metadata: While nodeSelector is like node affinity, it doesnt have the same and/or matchExpressions that affinity has. - name: ocp Karpenter will backoff and retry over time. If so, a pod might not be scheduled. - name: ocp Specify an operator. Learn about scheduling workloads with Karpenter, requiredDuringSchedulingIgnoredDuringExecution, Well-Known Labels, Annotations and Taints, Specify a memory request and a memory limit. Once the cluster nodes are created and properly labeled by Pipeline, deployments are run with the specified constraints automatically on top of Kubernetes. The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses an anti-affinity preferred rule to attempt to prevent scheduling with that pod. In case of requiredDuringSchedulingIgnoredDuringExecution only kubernetes.io/hostname is accepted as a value for topologyKey. Specify the key and values that must be met. The first constraint says you could use us-west-2a or us-west-2b, the second constraint makes it so only us-west-2b can be used. requiredDuringSchedulingIgnoredDuringExecution: labels: For example: Adding this to your podspec would result in: The three supported topologyKey values that Karpenter supports are: See Pod Topology Spread Constraints for details. There are two types of pod affinity rules: required and preferred. We can distinguish between two different effects: In the example above we used my-taint=test:NoSchedule and we can see that the node has been tainted and, according to the NoSchedule effect, already running pods have not been touched. docs: fixed inaccurate note about pod affinity (#2074) (c172de00), Needing to run in zones where dependent applications or storage are available, Requiring certain kinds of processors or other hardware, Wanting to use techniques like topology spread to help insure high availability, Pods being spread across zones, hosts, and capacity-type (, No more than one pod difference in the number of pods on each host (. kind: Pod security: s1 Anti-affinity rules allow you to prevent pods of a particular service from scheduling on the same nodes as pods of another service that are known to interfere with the performance of the pods of the first service. apiVersion: v1 The
Conceptually speaking, the topology key is the domain for which the matching rules are applied. Karpenter automatically detects storage scheduling requirements and includes them in node launch decisions. The pod pod-s2 has the label selector security:s2. kind: Pod Pods may require nodes with special hardware, isolation, or colocation with other pods running in the system. values: containers: The pod pod-s2 is not scheduled unless there is a node with a pod that has the security:s2 label. Karpenter supports standard Kubernetes scheduling constraints.
Specifies a weight for a preferred rule. This leads to pods with label my-label: test being placed on different nodes. - name: pod-affinity containers: Set node affinity for kube-dns so it selects the node that has the test-node-affinity: test label: Notice requiredDuringSchedulingIgnoredDuringExecution which tells the Kubernetes scheduler that: Note: requiredDuringSchedulingRequiredDuringExecution is not supported yet (Kubernetes 1.11) thus, if a label on a node changes pods that dont match, the new node label wont be evicted, but will continue to run on the node. In the above pod anti-affinity setting, the domain is defined by the kubernetes.io/hostname label of the nodes, which is the node where the pod runs, thus the labelSelector/matchExpressions is evaluated within the scope of a node. Or, you could spread the pods of a service across nodes or availability zones to reduce correlated failures. Karpenter is aware of label aliasing and translates this label into topology.kubernetes.io/zone in memory. - name: ocp Keep in mind when using both taints and node affinity that it is necessary to set them carefully to avoid these types of situations.
The team4a pod is scheduled on the same node as the team4 pod. So if capacity becomes available, it will schedule the pod without user intervention. Examples below illustrate how to use Node affinity to include (In) and exclude (NotIn) objects. The anti-affinity rule would cause it to avoid running on any node with a pod labeled app=inflate. When the pod is created, Karpenter follows references from the Pod to PersistentVolumeClaim to StorageClass and identifies that this pod requires storage in us-west-2a and us-west-2b. Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod. - key: security In a more human, readable format, a pod with the label my-label: test is only scheduled to node X if there is no other pod with the label my-label: test. The node with the highest weight is preferred. See Inter-pod affinity and anti-affinity in the Kubernetes documentation for details. If we wanted a specific node than the appropriate node affinity setting should have been placed onto the pod as well. When creating other pods, edit the Pod spec to set the following parameters: Use the podAntiAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter or preferredDuringSchedulingIgnoredDuringExecution parameter: Specify a weight for the node, 1-100. See Managing Resources for Containers for details on resource types supported by Kubernetes, Specify a memory request and a memory limit for examples of memory requests, and Provisioning Configuration for a list of supported resources. Specify a topologyKey, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain. Distributing instances of the same pod to different nodes has advantages but may have drawbacks as well. The Pipeline platform allows users to express their constraints in terms of resources (CPU, memory, network, IO, etc.). If you want to create a custom label, you should do that at the provisioner level. While the Kubernetes scheduler may try to distribute the replicas over multiple nodes this may not an inevitability.
metadata: users cant be stopped from deploying pods that tolerate wrong taint thus, beside system pods, pods other than desired ones may still run on the reserved nodes. metadata: If this anti-affinity term was on a deployment pod spec along with a matching app=inflate label, it would prevent more than one pod from the deployment from running on any single node. containers: In the following example, the StorageClass defines zonal topologies for us-west-2a and us-west-2b and binding mode WaitForFirstConsumer. For example, if the provisioner sets limits that allow only a particular zone to be used, and a pod asks for a different zone, it will not be scheduled. The operator can be In, NotIn, Exists, or DoesNotExist. The pod team4a has the label selector team:4 under podAffinity. You can think of these concepts as required and preferred, since Kubernetes never implemented other variants of these rules. It randomly selects us-west-2a, provisions a node in that zone, and binds the pod to the node. To prevent this situation, carefully configure pod affinity with equal-priority pods. metadata: spec: Learn more about moving to CSI providers. This is expected as the toleration allows the pod to be scheduled to a tainted node (it tolerates it) but doesnt necessary mean that the pod will actually be scheduled there. requirements. (A more relevant use case would be the running of pods on a distributed cache that should be collocated with pods using the cache). The key and value (label) that must be matched to apply the rule. As nodeAffinity encompasses what can be achieved with nodeSelectors, nodeSelectors will be deprecated in Kubernetes thus we discuss nodeAffinity here. If you want the new pod to be scheduled with the other pod, use the same key and value parameters as the label on the first pod. See Node affinity for details. The following diagram illustrates the pod affinity flow: Kubernetes provides building blocks to deal with various special scenarios with regards to deploying and running application components/services. operator: In Besides the requiredDuringSchedulingIgnoredDuringExecution type of node affinity there exists preferredDuringSchedulingIgnoredDuringExecution. This allows you to define a single set of rules that apply to both existing and provisioned capacity. Pod anti-affinity can prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod. The format of a taint is
By using the Kubernetes topologySpreadConstraints you can ask the provisioner to have pods push away from each other to limit the blast radius of an outage. We expect to see the kube-dns pod evicted and aws-node and kube-proxy to stay as these are deamonset system pods. If they all fail, Karpenter will fail to provision the pod. Using this Kubernetes feature we can create nodes that are reserved (dedicated) for specific pods. Pod anti-affinity helps with this. name: security-s1 image: docker.io/ocpqe/hello-pod, apiVersion: v1 name: pod-s1 Specify a key and value for the label. labels: spec: This can include well-known labels or custom labels you create yourself. team: "4" name: pod-s1
All examples below assume that the provisioner doesnt have constraints to prevent those zones from being used. Pod anti-affinity requires topologyKey to be set and all pods to have labels referenced by topologyKey. Then the pod can declare that custom label. image: docker.io/ocpqe/hello-pod, apiVersion: v1 With nodeSelector you can ask for a node that matches selected key-value pairs. containers: For example, use the operator In to require the label to be in the node. Here is an example of a nodeSelector for selecting nodes: This example features a well-known label (topology.kubernetes.io/zone) and a label that is well known to Karpenter (karpenter.sh/capacity-type). E.g.
If this is not the desired outcome, then instead of using the requiredDuringSchedulingIgnoredDuringExecution hard rule the preferredDuringSchedulingIgnoredDuringExecution soft rule should be utilized.
We can conclude that taints and tolerations are better used in those cases wherein we want to keep pods away from nodes, excepting a few select nodes. - name: security-s1 spec: The EBS CSI driver uses topology.ebs.csi.aws.com/zone instead of the standard topology.kubernetes.io/zone label. If you want the new pod to not be scheduled with the other pod, use the same key and value parameters as the label on the first pod. Once we bounce our pod we should see it being scheduled to node ip-192-168-101-21.us-west-2.compute.internal, since it matches by node affinity and node selector expression, and because the pod tolerates the taints of the node. Taints are the opposite of affinity. By using the podAffinity and podAntiAffinity configuration on a pod spec, you can inform the provisioner of your desire for pods to schedule together or apart with respect to different topology domains. In the next post we will describe the features that Pipeline provides to our user and how these rely on taints and tolerations, node affinity and pod affinity/anti-affinity, so stay tuned.
The following example demonstrates pod anti-affinity for pods with matching labels and label selectors. With node affinity we can tell Kubernetes which nodes to schedule to a pod using the labels on each node. For example, using affinity rules, you could spread or pack pods within a service or relative to pods in other services. When configuring a StorageClass for the EBS CSI Driver, you must use topology.ebs.csi.aws.com/zone. In a follow up post we will go into the details of how the Pipeline platform uses these and allows use of the underlying infrastructure in an efficient, automated way. podAffinity: containers: The pod anti-affinity rule says that the pod prefers to not schedule onto a node if that node is already running a pod with label having key security and value S2. Reasons for constraining where your pods run could include: Your Cloud Provider defines the first layer of constraints, including all instance types, architectures, zones, and purchase types available to its cloud. So all key-value pairs must match if you use nodeSelector. Setting a taint on a node tells the scheduler to not run a pod on it unless the pod has explicitly said it can tolerate that taint. See nodeSelector in the Kubernetes documentation for details. matchExpressions: The question of which node is up to the Kubernetes scheduler (in this case its ip-192-168-165-61.us-west-2.compute.internal).
This is by design, as system pods are required by the Kubernetes infrastructure (e.g. If there is no other pod with that label, the new pod remains in a pending state: requiredDuringSchedulingIgnoredDuringExecution, preferredDuringSchedulingIgnoredDuringExecution, apiVersion: v1 pods which require that most of the resources of the node be available to them in order to operate flawlessly should be scheduled to nodes that are reserved for them. The node that with highest weight is preferred. that prevents a pod from being scheduled on a node. You configure pod affinity/anti-affinity through the Pod spec files.
kind: Pod The following example demonstrates pod affinity for pods without matching labels and label selectors. What if the kube-dns does not tolerate the taint on node ip-192-168-101-21.us-west-2.compute.internal ? The cluster administrator adds the next layer of constraints by creating one or more provisioners. labels: These applications usually have needs which require special scheduling constraints. The pod pod-s2 has the label selector security:s1 under podAntiAffinity. Here, if us-west-2a is not available, the second term will cause the pod to run on a spot instance in us-west-2d. The CSI driver creates a PersistentVolume according to the PersistentVolumeClaim and gives it a node affinity rule for us-west-2a. Instance type selection math only uses requests, but limits may be configured to enable resource oversubscription. name: pod-s2 kind: Pod In OKD pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods. We set the label my-label: test on the pod which will be used to find pods, by label, within the domain defined by topologyKey. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement. (e.g the kubernetes.io/hostname label is set on each node by Kubernetes). security: s1
The following diagram illustrates pod anti-affinity flow: Pod affinity is similar to pod anti-affinity with the differences of the topologyKey not being limited to only kubernetes.io/hostname since it can be any label that consistently is placed on all pods. - labelSelector: affinity: image: docker.io/ocpqe/hello-pod, cat pod-s1.yaml The following examples demonstrate pod affinity and pod anti-affinity. metadata: The final layer comes from you adding specifications to your Kubernetes pod deployments. The pod pod-s2 cannot be scheduled on the same node as pod-s1. name: team4 In this example, the pod affinity rule indicates that the pod can schedule onto a node only if that node has at least one already-running pod with a label that has the key security and value S1. For a preferred rule, specify a weight, 1-100. Pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods. Required rules must be met before a pod can be scheduled on a node. Within a Pod spec, you can both make requests and set limits on resources a pod needs, such as CPU and memory. There are two normal pods kube-dns-7cc87d595-wbs7x and tiller-deploy-777677b45c-m9n27 the former running in node ip-192-168-101-21.us-west-2.compute.internal and the latter on ip-192-168-96-47.us-west-2.compute.internal. However, if Karpenter fails to provision on the first nodeSelectorTerms, it will try again using the second one. The first can be thought of as a hard rule, while the second constitutes a soft rule that Kubernetes tries to enforce but will not guarantee. Description of the pod label that determines when the anti-affinity rule applies. Well, the pod will remain in a Pending state as the node affinity Kubernetes scheduler tries to schedule it to a node that rejects the pod being scheduled. We can see that the kube-dns pod was stopped and started on a different node ip-192-168-165-61.us-west-2.compute.internal: Now if we want to make the kube-dns pod to be schedulable on the tainted ip-192-168-101-21.us-west-2.compute.internal node we need to place the appropriate toleration on the pod. The following example shows a Pod spec configured for pod affinity and anti-affinity. These Kubernetes features are useful in scenarios like: an application that consists of multiple services, some of which may require that they be co-located on the same node for performance reasons; replicas of critical services shouldnt be placed onto the same node to avoid loss in the event of node failure. spec: requiredDuringSchedulingIgnoredDuringExecution: kube-dns-669db795bb-5blv2 3/3 Running, kube-dns-55ccbc9fc-8xjfg 3/3 Running, kube-dns-55ccbc9fc-ms577 3/3 Running, kube-dns-85945db57c-kk288 3/3 Running, kube-dns-85945db57c-pzw2b 3/3 Running. image: docker.io/ocpqe/hello-pod, apiVersion: v1 If your pods have no requirements for how or where to run, you can let Karpenter choose nodes from the full range of available cloud provider resources. kube-system aws-node-vfkxn 10m (0%), kube-system kube-dns-7cc87d595-wbs7x 260m (6%), kube-system kube-proxy-z8hkv 100m (2%), "ip-192-168-101-21.us-west-2.compute.internal", Taints: my-taint=test:NoSchedule, Taints: my-taint=test:NoExecute, kube-system aws-node-vfkxn 10m (0%), kube-system kube-proxy-z8hkv 100m (2%), kube-system kube-dns-7cc87d595-cbsxg 3/3 Running, kube-dns-6848d77f98-vvkdq 3/3 Running. The operator represents the relationship between the label on the existing pod and the set of values in the. The pod pod-s1 has the label security:s1. In practice tainted nodes will be more like pseudo-reserved nodes, since taints and tolerations wont exclude undesired pods in certain circumstances: Ive set up a 3 node EKS cluster with Pipeline. When setting rules, the following Node affinity types define how hard or soft each rule is: The IgnoredDuringExecution part of each tells the pod to keep running, even if conditions change on the node so the rules no longer matched. For example, if there are not enough eligible nodes or available resources, not all desired replicas of the pod can be scheduled, thus consigning them to pending status. This Kubernetes feature allows users to mark a node (taint the node) so that no pods can be scheduled to it, unless a pod explicitly tolerates the taint. If labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod continues to run on the node. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule. Create a pod with a specific label in the Pod spec: When creating other pods, edit the Pod spec as follows: Use the podAffinity stanza to configure the requiredDuringSchedulingIgnoredDuringExecution parameter or preferredDuringSchedulingIgnoredDuringExecution parameter: Specify the key and value that must be met. Later on, the pod is deleted and a new pod is created that requests the same claim. The following diagram illustrates pod node affinity flow: Pod affinity and anti-affinity allows placing pods to nodes as a function of the labels of other pods. The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses affinity to allow scheduling with that pod. Anti-affinity is a property of pods Lets taint node ip-192-168-101-21.us-west-2.compute.internal that hosts the kube-dns-7cc87d595-wbs7x pod and the daemonset system pods. The following diagram illustrates the flow of taints and tolerations: In order to get the kube-dns pod scheduled to a specific node (in our case ip-192-168-101-21.us-west-2.compute.internal) we need to delve into our next topic node affinity. These requirements are turned into infrastructure specifications using Telescopes. topologyKey: kubernetes.io/hostname Its limits are set to 256MiB of memory and 1 CPU. There are two daemonset system pods: aws-node and kube-proxy running on every single node. Think of it as the Kubernetes evolution for pod affinity: it lets you relate pods with respect to nodes while still allowing spread. kind: Pod This time, Karpenter identifies that a PersistentVolume already exists for the PersistentVolumeClaim, and includes its zone us-west-2a in the pods scheduling requirements. spec: system pods are created with toleration settings that tolerate all taints thus can be scheduled onto any node. Enterprises often use multi-tenant and heterogenous clusters to deploy their applications to Kubernetes.
In general, Karpenter will go through each of the nodeSelectorTerms in order and take the first one that works. Changing the second operator to NotIn would allow the pod to run in us-west-2a only: Continuing to add to the example, nodeAffinity lets you define terms so if one term doesnt work it goes to the next one. We want to have multiple replicas of the kube-dns pod running while distributed across different nodes. Also, nodeSelector can do only do inclusions, while affinity can do inclusions and exclusions (In and NotIn). To get pods to be scheduled to specific nodes Kubernetes provides nodeSelectors and nodeAffinity. In this post we discuss how taints and tolerations, node affinity and pod affinity, anti-affinity work and can be used to instruct the Kubernetes scheduler to place pods on nodes that fulfill their special needs. Since the kube-dns pod is created through a deployment we are going to place the following toleration into the deployments spec: As we can see, the kube-dns pod is still running on node ip-192-168-165-61.us-west-2.compute.internal instead of the tainted ip-192-168-101-21.us-west-2.compute.internal even though we set the appropriate toleration for it. - s2 This example shows a Provisioner that was set up with a taint for only running pods that require a GPU, such as the following: For a pod to request to run on a node that has provisioner, it could set a toleration as follows: See Taints and Tolerations in the Kubernetes documentation for details. Now lets taint the same node with the NoExecute effect. However, by taking advantage of Karpenters model of layered constraints, you can be sure that the precise type and amount of resources needed are available to your pods. labels: image: docker.io/ocpqe/hello-pod, NAME READY STATUS RESTARTS AGE IP NODE, pod-s2 0/1 Pending 0 32s
