Node labels and taints with elx-nodegroup-controller
The elx-nodegroup-controller lets you declaratively manage labels and taints
across groups of nodes in your cluster. Instead of manually patching each node
after it joins, you define a NodeGroup resource that describes which nodes to
target and what labels and taints to apply. The controller keeps the nodes
in sync automatically, even after node replacements due to auto-healing or
scaling.
The controller is available as a managed add-on through Elastx and can also be deployed from the public GitHub repository if you prefer to manage it yourself.
When is this useful?
A few common scenarios:
- GPU or specialised hardware nodes — taint dedicated nodes so only workloads that explicitly tolerate the taint are scheduled on them.
- Cost or zone affinity — label nodes by nodegroup so workloads can use
nodeAffinityto target specific node types. - Workload isolation — mark nodes reserved for databases, batch jobs, or frontend replicas using a combination of labels and taints.
Concepts
A NodeGroup resource targets nodes in two ways — you can use one or both in
the same resource:
| Field | Behaviour |
|---|---|
spec.members |
Explicit list of node names. Exact match. |
spec.nodeGroupNames |
List of name segments. A node matches if any dash-separated part of its name equals one of the listed segments. |
When a NodeGroup is deleted the controller removes only the labels and taints
it applied. Labels and taints that were on the node before the NodeGroup
existed are left untouched.
Applying labels to a nodegroup
The following example labels all nodes whose name contains the segment gpu
(e.g. worker-gpu-a, worker-gpu-b, sto1-gpu-1):
apiVersion: k8s.elx.cloud/v1alpha2
kind: NodeGroup
metadata:
name: gpu-workers
spec:
nodeGroupNames:
- gpu
labels:
elastx.cloud/node-type: gpu
Apply it:
kubectl apply -f gpu-nodegroup.yaml
Verify the label was applied:
kubectl get nodes -l elastx.cloud/node-type=gpu
Adding a taint to restrict scheduling
Taints prevent workloads from being scheduled on a node unless they explicitly
tolerate the taint. This example taints the same gpu nodes with NoSchedule
so that only GPU-aware workloads land on them:
apiVersion: k8s.elx.cloud/v1alpha2
kind: NodeGroup
metadata:
name: gpu-workers
spec:
nodeGroupNames:
- gpu
labels:
elastx.cloud/node-type: gpu
taints:
- key: elastx.cloud/gpu
value: "true"
effect: NoSchedule
A pod that should run on these nodes needs a matching toleration:
tolerations:
- key: elastx.cloud/gpu
operator: Equal
value: "true"
effect: NoSchedule
Targeting nodes by name
If you want to target specific nodes rather than relying on naming patterns,
list them explicitly in spec.members:
apiVersion: k8s.elx.cloud/v1alpha2
kind: NodeGroup
metadata:
name: database-nodes
spec:
members:
- worker-sto1-db-1
- worker-sto2-db-1
- worker-sto3-db-1
taints:
- key: elastx.cloud/role
value: database
effect: NoSchedule
labels:
elastx.cloud/role: database
You can mix members and nodeGroupNames in the same resource — the sets are
merged and deduplicated automatically.
Good to know
- Reserved label prefixes — the controller will reject a
NodeGroupthat tries to set labels with the prefixeskubernetes.io/,k8s.io/,node.kubernetes.io/, ornode-role.kubernetes.io/. These are reserved for Kubernetes itself. - Taint effects — valid values are
NoSchedule,PreferNoSchedule, andNoExecute. - Limits — a single
NodeGroupsupports up to 500 explicit members, 50 name segments, 64 labels, and 100 taints. - Cleanup on delete — deleting a
NodeGrouptriggers the controller to remove the labels and taints it applied before the resource is fully removed. The controller uses a finalizer to ensure this happens even if the deletion races with a node replacement. - Node replacements — when auto-healing replaces a node the new node will receive the correct labels and taints on its first reconciliation, no manual intervention needed.