The default OpenShift Container Platform pod scheduler is responsible for determining placement of new pods onto nodes within the cluster. It reads data from the pod and tries to find a node that is a good fit based on configured policies. It is completely independent and exists as a standalone/pluggable solution. It does not modify the pod and just creates a binding for the pod that ties the pod to the particular node.

A selection of predicates and priorities defines the policy for the scheduler. See Modifying scheduler policy for a list of predicates and priorities.

Sample default scheduler object
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  creationTimestamp: 2019-05-20T15:39:01Z
  generation: 1
  name: cluster
  resourceVersion: "1491"
  selfLink: /apis/config.openshift.io/v1/schedulers/cluster
  uid: 6435dd99-7b15-11e9-bd48-0aec821b8e34
spec:
  policy: (1)
    name: scheduler-policy
  defaultNodeSelector: type=user-node,region=east (2)
1 You can specify the name of a custom scheduler policy file.
2 Optionally, specify a default node selector to restrict pod placement to specific nodes.

Understanding default scheduling

The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:

Filters the Nodes

The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates.

Prioritize the Filtered List of Nodes

This is achieved by passing each node through a series of priority_ functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each priority function. The node score provided by each priority function is multiplied by the weight (default weight for most priorities is 1) and then combined by adding the scores for each node provided by all the priorities. This weight attribute can be used by administrators to give higher importance to some priorities.

Select the Best Fit Node

The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.

Understanding Scheduler Policy

The selection of the predicate and priorities defines the policy for the scheduler.

The scheduler configuration file is a JSON file, which must be named policy.cfg, that specifies the predicates and priorities the scheduler will consider.

In the absence of the scheduler policy file, the default scheduler behavior is used.

The predicates and priorities defined in the scheduler configuration file completely override the default scheduler policy. If any of the default predicates and priorities are required, you must explicitly specify the functions in the policy configuration.

Sample scheduler ConfigMap
apiVersion: v1
data:
  policy.cfg: |
    {
        "kind" : "Policy",
        "apiVersion" : "v1",
        "predicates" : [
                {"name" : "MaxGCEPDVolumeCount"},
                {"name" : "GeneralPredicates"},
                {"name" : "MaxAzureDiskVolumeCount"},
                {"name" : "MaxCSIVolumeCountPred"},
                {"name" : "CheckVolumeBinding"},
                {"name" : "MaxEBSVolumeCount"},
                {"name" : "PodFitsResources"},
                {"name" : "MatchInterPodAffinity"},
                {"name" : "CheckNodeUnschedulable"},
                {"name" : "NoDiskConflict"},
                {"name" : "NoVolumeZoneConflict"},
                {"name" : "MatchNodeSelector"},
                {"name" : "HostName"},
                {"name" : "PodToleratesNodeTaints"}
                ],
        "priorities" : [
                {"name" : "LeastRequestedPriority", "weight" : 1},
                {"name" : "BalancedResourceAllocation", "weight" : 1},
                {"name" : "ServiceSpreadingPriority", "weight" : 1},
                {"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
                {"name" : "NodeAffinityPriority", "weight" : 1},
                {"name" : "TaintTolerationPriority", "weight" : 1},
                {"name" : "ImageLocalityPriority", "weight" : 1},
                {"name" : "SelectorSpreadPriority", "weight" : 1},
                {"name" : "InterPodAffinityPriority", "weight" : 1},
                {"name" : "EqualPriority", "weight" : 1}
                ]
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2019-09-17T08:42:33Z"
  name: scheduler-policy
  namespace: openshift-config
  resourceVersion: "59500"
  selfLink: /api/v1/namespaces/openshift-config/configmaps/scheduler-policy
  uid: 17ee8865-d927-11e9-b213-02d1e1709840`

Creating a scheduler policy file

You can control change the default scheduling behavior by creating a JSON file with using the with the desired predicates and priorities. You then generate a ConfigMap from the JSON file and point the cluster Scheduler object to use the ConfigMap.

Procedure

To configure the scheduler policy:

  1. Create the a JSON file named policy.cfg with the desired predicates and priorities.

    Sample scheduler JSON file
    {
    "kind" : "Policy",
    "apiVersion" : "v1",
    "predicates" : [ (1)
            {"name" : "PodFitsHostPorts"},
            {"name" : "PodFitsResources"},
            {"name" : "NoDiskConflict"},
            {"name" : "NoVolumeZoneConflict"},
            {"name" : "MatchNodeSelector"},
            {"name" : "MaxEBSVolumeCount"},
            {"name" : "MaxAzureDiskVolumeCount"},
            {"name" : "checkServiceAffinity"},
            {"name" : "PodToleratesNodeNoExecuteTaints"},
            {"name" : "MaxGCEPDVolumeCount"},
            {"name" : "MatchInterPodAffinity"},
            {"name" : "PodToleratesNodeTaints"},
            {"name" : "HostName"}
            ],
    "priorities" : [(2)
            {"name" : "LeastRequestedPriority", "weight" : 1},
            {"name" : "BalancedResourceAllocation", "weight" : 1},
            {"name" : "ServiceSpreadingPriority", "weight" : 1},
            {"name" : "EqualPriority", "weight" : 1}
            ]
    }
    1 Add the predicates as needed.
    2 Add the priorities as needed.
  2. Create a ConfigMap based on the scheduler JSON file:

    $ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name> (1)
    1 Enter a name for the ConfigMap.

    For example:

    $ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy
    
    configmap/scheduler-policy created
    apiVersion: v1
    data:
      policy.cfg: |
        {
            "kind" : "Policy",
            "apiVersion" : "v1",
            "predicates" : [
                    {"name" : "MaxGCEPDVolumeCount"},
                    {"name" : "GeneralPredicates"},
                    {"name" : "MaxAzureDiskVolumeCount"},
                    {"name" : "MaxCSIVolumeCountPred"},
                    {"name" : "CheckVolumeBinding"},
                    {"name" : "MaxEBSVolumeCount"},
                    {"name" : "PodFitsResources"},
                    {"name" : "MatchInterPodAffinity"},
                    {"name" : "CheckNodeUnschedulable"},
                    {"name" : "NoDiskConflict"},
                    {"name" : "NoVolumeZoneConflict"},
                    {"name" : "MatchNodeSelector"},
                    {"name" : "HostName"},
                    {"name" : "PodToleratesNodeTaints"}
                    ],
            "priorities" : [
                    {"name" : "LeastRequestedPriority", "weight" : 1},
                    {"name" : "BalancedResourceAllocation", "weight" : 1},
                    {"name" : "ServiceSpreadingPriority", "weight" : 1},
                    {"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
                    {"name" : "NodeAffinityPriority", "weight" : 1},
                    {"name" : "TaintTolerationPriority", "weight" : 1},
                    {"name" : "ImageLocalityPriority", "weight" : 1},
                    {"name" : "SelectorSpreadPriority", "weight" : 1},
                    {"name" : "InterPodAffinityPriority", "weight" : 1},
                    {"name" : "EqualPriority", "weight" : 1}
                    ]
        }
    kind: ConfigMap
    metadata:
      creationTimestamp: "2019-09-17T08:42:33Z"
      name: scheduler-policy
      namespace: openshift-config
      resourceVersion: "59500"
      selfLink: /api/v1/namespaces/openshift-config/configmaps/scheduler-policy
      uid: 17ee8865-d927-11e9-b213-02d1e1709840`
  3. Edit the Scheduler Operator Custom Resource to add the ConfigMap:

    $ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"<configmap-name>"}}}' --type=merge (1)
    1 Specify the name of the ConfigMap.

    For example:

    $ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge

    After making the change to the Scheduler config resource, wait for the opensift-kube-apiserver pods to redeploy. This can take several minutes. Until the pods redeploy, new scheduler does not take effect.

  4. Verify the scheduler policy is configured by viewing the log of a scheduler pod in the openshift-kube-scheduler namespace. The following command checks for the predoicates and priorites that are being registered by the scheduler:

    $ oc logs <scheduler-pod> | grep predicates

    For example:

    $ oc logs openshift-kube-scheduler-ip-10-0-141-29.ec2.internal | grep predicates
    
    Creating scheduler with fit predicates 'map[MaxGCEPDVolumeCount:{} MaxAzureDiskVolumeCount:{} CheckNodeUnschedulable:{} NoDiskConflict:{} NoVolumeZoneConflict:{} MatchNodeSelector:{} GeneralPredicates:{} MaxCSIVolumeCountPred:{} CheckVolumeBinding:{} MaxEBSVolumeCount:{} PodFitsResources:{} MatchInterPodAffinity:{} HostName:{} PodToleratesNodeTaints:{}]' and priority functions 'map[InterPodAffinityPriority:{} LeastRequestedPriority:{} ServiceSpreadingPriority:{} ImageLocalityPriority:{} SelectorSpreadPriority:{} EqualPriority:{} BalancedResourceAllocation:{} NodePreferAvoidPodsPriority:{} NodeAffinityPriority:{} TaintTolerationPriority:{}]'

Modifying scheduler policies

You change scheduling behavior by creating or editing your scheduler policy ConfigMap in the openshift-config project. Add and remove predicates and priorities to the ConfigMap to create a scheduler policy.

Procedure

To modify the current custom schedluling:

  • Edit the scheduler policy ConfigMap:

    $ oc edit configmap <configmap-name>  -n openshift-config

    For example:

    $ oc edit configmap scheduler-policy -n openshift-config
    
    apiVersion: v1
    data:
      policy.cfg: |
        {
            "kind" : "Policy",
            "apiVersion" : "v1",
            "predicates" : [ (1)
                    {"name" : "MaxGCEPDVolumeCount"},
                    {"name" : "GeneralPredicates"},
                    {"name" : "MaxAzureDiskVolumeCount"},
                    {"name" : "MaxCSIVolumeCountPred"},
                    {"name" : "CheckVolumeBinding"},
                    {"name" : "MaxEBSVolumeCount"},
                    {"name" : "PodFitsResources"},
                    {"name" : "MatchInterPodAffinity"},
                    {"name" : "CheckNodeUnschedulable"},
                    {"name" : "NoDiskConflict"},
                    {"name" : "NoVolumeZoneConflict"},
                    {"name" : "MatchNodeSelector"},
                    {"name" : "HostName"},
                    {"name" : "PodToleratesNodeTaints"}
                    ],
            "priorities" : [ (2)
                    {"name" : "LeastRequestedPriority", "weight" : 1},
                    {"name" : "BalancedResourceAllocation", "weight" : 1},
                    {"name" : "ServiceSpreadingPriority", "weight" : 1},
                    {"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
                    {"name" : "NodeAffinityPriority", "weight" : 1},
                    {"name" : "TaintTolerationPriority", "weight" : 1},
                    {"name" : "ImageLocalityPriority", "weight" : 1},
                    {"name" : "SelectorSpreadPriority", "weight" : 1},
                    {"name" : "InterPodAffinityPriority", "weight" : 1},
                    {"name" : "EqualPriority", "weight" : 1}
                    ]
        }
    kind: ConfigMap
    metadata:
      creationTimestamp: "2019-09-17T17:44:19Z"
      name: policy-configmap
      namespace: openshift-kube-scheduler
      resourceVersion: "15370"
      selfLink: /api/v1/namespaces/openshift-kube-scheduler/configmaps/policy-configmap
    1 Add or remove predicates as needed.
    2 Add, remove, or change the weight of predicates as needed.

    It can take a few minutes for the scheduler to restart the pods with the updated policy.

  • Change the policies and predicates being used:

    1. Remove the scheduler policy CongifMap:

      $ oc delete configmap -n openshift-config <name>

      For example:

      $ oc delete configmap -n openshift-config  scheduler-policy
    2. Edit the policy.cfg file to add and remove policies and predicates as needed.

      For example:

      $ vi policy.cfg
      apiVersion: v1
      data:
        policy.cfg: |
          {
          "kind" : "Policy",
          "apiVersion" : "v1",
          "predicates" : [
                  {"name" : "PodFitsHostPorts"},
                  {"name" : "PodFitsResources"},
                  {"name" : "NoDiskConflict"},
                  {"name" : "NoVolumeZoneConflict"},
                  {"name" : "MatchNodeSelector"},
                  {"name" : "MaxEBSVolumeCount"},
                  {"name" : "MaxAzureDiskVolumeCount"},
                  {"name" : "CheckVolumeBinding"},
                  {"name" : "CheckServiceAffinity"},
                  {"name" : "PodToleratesNodeNoExecuteTaints"},
                  {"name" : "MaxGCEPDVolumeCount"},
                  {"name" : "MatchInterPodAffinity"},
                  {"name" : "PodToleratesNodeTaints"},
                  {"name" : "HostName"}
                  ],
          "priorities" : [
                  {"name" : "LeastRequestedPriority", "weight" : 2},
                  {"name" : "BalancedResourceAllocation", "weight" : 2},
                  {"name" : "ServiceSpreadingPriority", "weight" : 2},
                  {"name" : "EqualPriority", "weight" : 2}
                  ]
          }
    3. Re-create the scheduler policy ConfigMap based on the scheduler JSON file:

      $ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name> (1)
      1 Enter a name for the ConfigMap.

      For example:

      $ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy
      
      configmap/scheduler-policy created

Understanding the scheduler predicates

Predicates are rules that filter out unqualified nodes.

There are several predicates provided by default in OpenShift Container Platform. Some of these predicates can be customized by providing certain parameters. Multiple predicates can be combined to provide additional filtering of nodes.

Static Predicates

These predicates do not take any configuration parameters or inputs from the user. These are specified in the scheduler configuration using their exact name.

Default Predicates

The default scheduler policy includes the following predicates:

NoVolumeZoneConflict checks that the volumes a pod requests are available in the zone.

{"name" : "NoVolumeZoneConflict"}

MaxEBSVolumeCount checks the maximum number of volumes that can be attached to an AWS instance.

{"name" : "MaxEBSVolumeCount"}

MaxAzureDiskVolumeCount checks the maximum number of Azure Disk Volumes.

{"name" : "MaxAzureDiskVolumeCount"}

PodToleratesNodeTaints checks if a pod can tolerate the node taints.

{"name" : "PodToleratesNodeTaints"}

CheckNodeUnschedulable checks if a pod can be scheduled on a node with Unschedulable spec.

{"name" : "CheckNodeUnschedulable"}

CheckVolumeBinding evaluates if a pod can fit based on the volumes, it requests, for both bound and unbound PVCs. * For PVCs that are bound, the predicate checks that the corresponding PV’s node affinity is satisfied by the given node. * For PVCs that are unbound, the predicate searched for available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.

The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound PVCs can be matched with an available and node-compatible PV.

{"name" : "CheckVolumeBinding"}

NoDiskConflict checks if the volume requested by a pod is available.

{"name" : "NoDiskConflict"}

MaxGCEPDVolumeCount checks the maximum number of Google Compute Engine (GCE) Persistent Disks (PD).

{"name" : "MaxGCEPDVolumeCount"}

MaxCSIVolumeCountPred

MatchInterPodAffinity checks if the pod affinity/anti-affinity rules permit the pod.

{"name" : "MatchInterPodAffinity"}
Other Static Predicates

OpenShift Container Platform also supports the following predicates:

The CheckNode-* predicates cannot be used if the Taint Nodes By Condition feature is enabled. The Taint Nodes By Condition feature is enabled by default.

CheckNodeMemoryPressure checks if a pod can be scheduled on a node with a memory pressure condition.

{"name" : "CheckNodeMemoryPressure"}

CheckNodeDiskPressure checks if a pod can be scheduled on a node with a disk pressure condition.

{"name" : "CheckNodeDiskPressure"}

CheckNodeCondition checks if a pod can be scheduled on a node reporting out of disk, network unavailable, or not ready conditions.

{"name" : "CheckNodeCondition"}

CheckNodeLabelPresence checks if all of the specified labels exist on a node, regardless of their value.

{"name" : "CheckNodeLabelPresence"}

checkServiceAffinity checks that ServiceAffinity labels are homogeneous for pods that are scheduled on a node.

{"name" : "checkServiceAffinity"}

PodToleratesNodeNoExecuteTaints checks if a pod tolerations can tolerate a node NoExecute taints.

{"name" : "PodToleratesNodeNoExecuteTaints"}

General Predicates

The following general predicates check whether non-critical predicates and essential predicates pass. Non-critical predicates are the predicates that only non-critical pods must pass and essential predicates are the predicates that all pods must pass.

The default scheduler policy includes the general predicates.

Non-critical general predicates

PodFitsResources determines a fit based on resource availability (CPU, memory, GPU, and so forth). The nodes can declare their resource capacities and then pods can specify what resources they require. Fit is based on requested, rather than used resources.

{"name" : "PodFitsResources"}
Essential general predicates

PodFitsHostPorts determines if a node has free ports for the requested pod ports (absence of port conflicts).

{"name" : "PodFitsHostPorts"}

HostName determines fit based on the presence of the Host parameter and a string match with the name of the host.

{"name" : "HostName"}

MatchNodeSelector determines fit based on node selector (nodeSelector) queries defined in the pod.

{"name" : "MatchNodeSelector"}

Understanding the scheduler priorities

Priorities are rules that rank nodes according to preferences.

A custom set of priorities can be specified to configure the scheduler. There are several priorities provided by default in OpenShift Container Platform. Other priorities can be customized by providing certain parameters. Multiple priorities can be combined and different weights can be given to each in order to impact the prioritization.

Static Priorities

Static priorities do not take any configuration parameters from the user, except weight. A weight is required to be specified and cannot be 0 or negative.

These are specified in the scheduler policy Configmap, policy-configmap in the openshift-config project.

Default Priorities

The default scheduler policy includes the following priorities. Each of the priority function has a weight of 1 except NodePreferAvoidPodsPriority, which has a weight of 10000.

NodeAffinityPriority prioritizes nodes according to node affinity scheduling preferences

{"name" : "NodeAffinityPriority", "weight" : 1}

TaintTolerationPriority prioritizes nodes that have a fewer number of intolerable taints on them for a pod. An intolerable taint is one which has key PreferNoSchedule.

{"name" : "TaintTolerationPriority", "weight" : 1}

ImageLocalityPriority prioritizes nodes that already have requested pod container’s images.

{"name" : "ImageLocalityPriority", "weight" : 1}

SelectorSpreadPriority looks for services, replication controllers (RC), replication sets (RS), and stateful sets that match the pod, then finds existing pods that match those selectors. The scheduler favors nodes that have fewer existing matching pods. Then, it schedules the pod on a node with the smallest number of pods that match those selectors as the pod being scheduled.

{"name" : "SelectorSpreadPriority", "weight" : 1}

InterPodAffinityPriority computes a sum by iterating through the elements of weightedPodAffinityTerm and adding weight to the sum if the corresponding PodAffinityTerm is satisfied for that node. The node(s) with the highest sum are the most preferred.

{"name" : "InterPodAffinityPriority", "weight" : 1}

LeastRequestedPriority favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes nodes that have the highest available/remaining capacity.

{"name" : "LeastRequestedPriority", "weight" : 1}

BalancedResourceAllocation favors nodes with balanced resource usage rate. It calculates the difference between the consumed CPU and memory as a fraction of capacity, and prioritizes the nodes based on how close the two metrics are to each other. This should always be used together with LeastRequestedPriority.

{"name" : "BalancedResourceAllocation", "weight" : 1}

NodePreferAvoidPodsPriority ignores pods that are owned by a controller other than a replication controller.

{"name" : "NodePreferAvoidPodsPriority", "weight" : 10000}
Other Static Priorities

OpenShift Container Platform also supports the following priorities:

EqualPriority gives an equal weight of 1 to all nodes, if no priority configurations are provided. We recommend using this priority only for testing environments.

{"name" : "EqualPriority", "weight" : 1}

MostRequestedPriority prioritizes nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.

{"name" : "MostRequestedPriority", "weight" : 1}

ServiceSpreadingPriority spreads pods by minimizing the number of pods belonging to the same service onto the same machine.

{"name" : "ServiceSpreadingPriority", "weight" : 1}

Configurable Priorities

You can configure these priorities in the scheduler policy Configmap, policy-configmap in the openshift-config project, to add labels to affect how the priorities.

The type of the priority function is identified by the argument that they take. Since these are configurable, multiple priorities of the same type (but different configuration parameters) can be combined as long as their user-defined names are different.

For information on using these priorities, see Modifying Scheduler Policy.

ServiceAntiAffinity takes a label and ensures a good spread of the pods belonging to the same service across the group of nodes based on the label values. It gives the same score to all nodes that have the same value for the specified label. It gives a higher score to nodes within a group with the least concentration of pods.

{
"kind": "Policy",
"apiVersion": "v1",

"priorities":[
    {
        "name":"<name>", (1)
        "weight" : 1 (2)
        "argument":{
            "serviceAntiAffinity":{
                "label": "<label>" (3)
                }
           }
       }
   ]
}
1 Specify a name for the priority.
2 Specify a weight. Enter a non-zero positive value.
3 Specify a label to match.

For example:

{
"kind": "Policy",
"apiVersion": "v1",
"priorities": [
    {
        "name":"RackSpread",
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {
                "label": "rack"
                }
           }
       }
   ]
}

In some situations using ServiceAntiAffinity based on custom labels does not spread pod as expected. See this Red Hat Solution.

*The labelPreference parameter gives priority based on the specified label. If the label is present on a node, that node is given priority. If no label is specified, priority is given to nodes that do not have a label.

{
"kind": "Policy",
"apiVersion": "v1",
"priorities":[
    {
        "name":"<name>", (1)
        "weight" : 1 (2)
        "argument":{
            "labelPreference":{
                "label": "<label>", (3)
                "presence": true (4)
                }
            }
        }
    ]
}
1 Specify a name for the priority.
2 Specify a weight. Enter a non-zero positive value.
3 Specify a label to match.
4 Specify whether the label is required, either true or false.

Sample Policy Configurations

The configuration below specifies the default scheduler configuration, if it were to be specified using the scheduler policy file.

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "RegionZoneAffinity", (1)
        "argument": {
            "serviceAffinity": {  (2)
              "labels": "region, zone"  (3)
           }
        }
     }
  ],
"priorities": [
    {
        "name":"RackSpread", (4)
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {  (5)
                "label": "rack"  (6)
                }
           }
       }
   ]
}
1 The name for the predicate.
2 The type of predicate.
3 The labels for the predicate.
4 The name for the priority.
5 The type of priority.
6 The labels for the priority.

In all of the sample configurations below, the list of predicates and priority functions is truncated to include only the ones that pertain to the use case specified. In practice, a complete/meaningful scheduler policy should include most, if not all, of the default predicates and priorities listed above.

The following example defines three topological levels, region (affinity) → zone (affinity) → rack (anti-affinity):

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "RegionZoneAffinity",
        "argument": {
            "serviceAffinity": {
              "label": "region, zone"
           }
        }
     }
  ],
"priorities": [
    {
        "name":"RackSpread",
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {
                "label": "rack"
                }
           }
       }
   ]
}

The following example defines three topological levels, city (affinity) → building (anti-affinity) → room (anti-affinity):

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "CityAffinity",
        "argument": {
            "serviceAffinity": {
              "label": "city"
           }
        }
     }
  ],
"priorities": [
    {
        "name":"BuildingSpread",
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {
                "label": "building"
                }
           }
       },
    {
        "name":"RoomSpread",
        "weight" : 1,
        "argument": {
            "serviceAntiAffinity": {
                "label": "room"
                }
           }
       }
   ]
}

The following example defines a policy to only use nodes with the 'region' label defined and prefer nodes with the 'zone' label defined:

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "RequireRegion",
        "argument": {
            "labelPreference": {
                "label": "region",
                "presence": true
           }
        }
     }
  ],
"priorities": [
    {
        "name":"ZonePreferred",
        "weight" : 1,
        "argument": {
            "labelPreference": {
                "label": "zone",
                "presence": true
                }
           }
       }
   ]
}

The following example combines both static and configurable predicates and also priorities:

{
"kind": "Policy",
"apiVersion": "v1",
"predicates": [
    {
        "name": "RegionAffinity",
        "argument": {
            "serviceAffinity": {
                "label": "region"
           }
        }
     },
    {
        "name": "RequireRegion",
        "argument": {
            "labelsPresence": {
                "label": "region",
                "presence": true
           }
        }
     },
    {
        "name": "BuildingNodesAvoid",
        "argument": {
            "labelsPresence": {
                "label": "building",
                "presence": false
           }
        }
     },
     {"name" : "PodFitsPorts"},
     {"name" : "MatchNodeSelector"}
     ],
"priorities": [
    {
        "name": "ZoneSpread",
        "weight" : 2,
        "argument": {
            "serviceAntiAffinity":{
                "label": "zone"
                }
           }
       },
    {
        "name":"ZonePreferred",
        "weight" : 1,
        "argument": {
            "labelPreference":{
                "label": "zone",
                "presence": true
                }
           }
       },
    {"name" : "ServiceSpreadingPriority", "weight" : 1}
    ]
}