Node labels and taints with elx-nodegroup-controller

Automatically apply labels and taints to groups of nodes using the nodegroup controller

The elx-nodegroup-controller lets you declaratively manage labels and taints across groups of nodes in your cluster. Instead of manually patching each node after it joins, you define a NodeGroup resource that describes which nodes to target and what labels and taints to apply. The controller keeps the nodes in sync automatically, even after node replacements due to auto-healing or scaling.

The controller is available as a managed add-on through Elastx and can also be deployed from the public GitHub repository if you prefer to manage it yourself.


When is this useful?

A few common scenarios:

  • GPU or specialised hardware nodes — taint dedicated nodes so only workloads that explicitly tolerate the taint are scheduled on them.
  • Cost or zone affinity — label nodes by nodegroup so workloads can use nodeAffinity to target specific node types.
  • Workload isolation — mark nodes reserved for databases, batch jobs, or frontend replicas using a combination of labels and taints.

Concepts

A NodeGroup resource targets nodes in two ways — you can use one or both in the same resource:

Field Behaviour
spec.members Explicit list of node names. Exact match.
spec.nodeGroupNames List of name segments. A node matches if any dash-separated part of its name equals one of the listed segments.

When a NodeGroup is deleted the controller removes only the labels and taints it applied. Labels and taints that were on the node before the NodeGroup existed are left untouched.


Applying labels to a nodegroup

The following example labels all nodes whose name contains the segment gpu (e.g. worker-gpu-a, worker-gpu-b, sto1-gpu-1):

apiVersion: k8s.elx.cloud/v1alpha2
kind: NodeGroup
metadata:
  name: gpu-workers
spec:
  nodeGroupNames:
    - gpu
  labels:
    elastx.cloud/node-type: gpu

Apply it:

kubectl apply -f gpu-nodegroup.yaml

Verify the label was applied:

kubectl get nodes -l elastx.cloud/node-type=gpu

Adding a taint to restrict scheduling

Taints prevent workloads from being scheduled on a node unless they explicitly tolerate the taint. This example taints the same gpu nodes with NoSchedule so that only GPU-aware workloads land on them:

apiVersion: k8s.elx.cloud/v1alpha2
kind: NodeGroup
metadata:
  name: gpu-workers
spec:
  nodeGroupNames:
    - gpu
  labels:
    elastx.cloud/node-type: gpu
  taints:
    - key: elastx.cloud/gpu
      value: "true"
      effect: NoSchedule

A pod that should run on these nodes needs a matching toleration:

tolerations:
  - key: elastx.cloud/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule

Targeting nodes by name

If you want to target specific nodes rather than relying on naming patterns, list them explicitly in spec.members:

apiVersion: k8s.elx.cloud/v1alpha2
kind: NodeGroup
metadata:
  name: database-nodes
spec:
  members:
    - worker-sto1-db-1
    - worker-sto2-db-1
    - worker-sto3-db-1
  taints:
    - key: elastx.cloud/role
      value: database
      effect: NoSchedule
  labels:
    elastx.cloud/role: database

You can mix members and nodeGroupNames in the same resource — the sets are merged and deduplicated automatically.


Good to know

  • Reserved label prefixes — the controller will reject a NodeGroup that tries to set labels with the prefixes kubernetes.io/, k8s.io/, node.kubernetes.io/, or node-role.kubernetes.io/. These are reserved for Kubernetes itself.
  • Taint effects — valid values are NoSchedule, PreferNoSchedule, and NoExecute.
  • Limits — a single NodeGroup supports up to 500 explicit members, 50 name segments, 64 labels, and 100 taints.
  • Cleanup on delete — deleting a NodeGroup triggers the controller to remove the labels and taints it applied before the resource is fully removed. The controller uses a finalizer to ensure this happens even if the deletion races with a node replacement.
  • Node replacements — when auto-healing replaces a node the new node will receive the correct labels and taints on its first reconciliation, no manual intervention needed.
Last modified June 2, 2026: Changed to beta (#294) (4c9798d)