Module 1.7: Customizing the Scheduler
Цей контент ще не доступний вашою мовою.
Complexity:
[COMPLEX]- Extending Kubernetes scheduling decisionsTime to Complete: 4 hours
Prerequisites: Module 1.1 (API Deep Dive), understanding of Pod scheduling basics
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement a custom Filter plugin that excludes nodes based on real-time conditions (GPU utilization, rack topology, compliance labels)
- Implement a custom Score plugin that ranks nodes using business-specific criteria like data locality or cost optimization
- Deploy a scheduler plugin alongside the default scheduler using KubeSchedulerConfiguration profiles
- Evaluate when to use scheduler plugins vs. scheduling constraints (affinity, taints) for a given workload placement requirement
Why This Module Matters
Section titled “Why This Module Matters”The default Kubernetes scheduler is remarkably capable — it handles affinity, anti-affinity, resource requests, taints, tolerations, topology constraints, and more. But there are scheduling decisions it cannot make out of the box. What if you want to schedule Pods based on real-time GPU utilization instead of static resource requests? What if you need to colocate certain workloads on the same rack for network latency? What if your compliance requirements demand scheduling based on data residency labels?
The Scheduling Framework is the answer. Since Kubernetes 1.19, the scheduler is built as a plugin system with well-defined extension points. You can write plugins in Go that hook into any stage of the scheduling pipeline — filtering nodes, scoring them, binding Pods, or even preempting lower-priority workloads. You can also run multiple schedulers side by side, each with its own personality.
The Restaurant Seating Analogy
The default scheduler is like a restaurant host who seats guests based on table availability and party size. A custom Filter plugin adds rules: “this party requested a window seat” (node affinity). A custom Score plugin adds preferences: “seat regulars closer to the kitchen for faster service” (colocation scoring). A custom Bind plugin changes the reservation process: “hold this table for 5 minutes while we confirm the reservation” (custom binding). The host still does the work; your plugins influence the decisions.
What You’ll Learn
Section titled “What You’ll Learn”By the end of this module, you will be able to:
- Understand the complete scheduling framework and its extension points
- Write custom Filter and Score plugins in Go
- Configure KubeSchedulerConfiguration for custom plugins
- Deploy a secondary scheduler alongside the default one
- Debug scheduling decisions using events and logs
Did You Know?
Section titled “Did You Know?”-
The default scheduler evaluates up to 100 nodes per scheduling cycle (configurable via
percentageOfNodesToScore). For a 5,000-node cluster, it does not check every node — it samples and picks the best from the sample. This is why scheduling is fast even at massive scale. -
Kubernetes supports running multiple schedulers simultaneously: You can have the default scheduler for most workloads and a custom scheduler for GPU workloads, each with different plugins enabled. Pods declare which scheduler to use via
spec.schedulerName. -
The Scheduling Framework is the recommended extension mechanism since Kubernetes 1.19. Scheduler extenders (webhook-based) and standalone custom schedulers are still supported but the Framework offers better performance and deeper integration. Scheduler policy files are deprecated.
Part 1: The Scheduling Framework
Section titled “Part 1: The Scheduling Framework”1.1 Scheduling Cycle Overview
Section titled “1.1 Scheduling Cycle Overview”flowchart TD Start([Pod enters scheduling queue]) --> PreEnqueue
subgraph Queueing Phase PreEnqueue[1. PreEnqueue<br/>Reject pods before queuing] --> Sort[2. Sort<br/>Order pods in the queue] end
Sort --> PreFilter
subgraph Scheduling Cycle PreFilter[3. PreFilter<br/>Compute shared state] --> Filter[4. Filter<br/>Eliminate infeasible nodes] Filter -- If no nodes fit --> PostFilter[5. PostFilter<br/>Handle preemption] Filter -- If nodes pass --> PreScore[6. PreScore<br/>Compute shared score state] PostFilter -. Retry scheduling .-> PreFilter PreScore --> Score[7. Score<br/>Rank feasible nodes 0-100] Score --> Normalize[8. NormalizeScore<br/>Normalize scores] Normalize --> Reserve[9. Reserve<br/>Optimistically assume placement] Reserve --> Permit[10. Permit<br/>Hold, allow, or deny binding] end
Permit --> PreBind
subgraph Binding Cycle PreBind[11. PreBind<br/>Pre-binding operations] --> Bind[12. Bind<br/>Actually bind Pod to Node] Bind --> PostBind[13. PostBind<br/>Informational, after binding] end1.2 Extension Points Reference
Section titled “1.2 Extension Points Reference”| Extension Point | When It Runs | What It Does | Return Type |
|---|---|---|---|
| PreEnqueue | Before queuing | Gate pods from entering queue | Allow/Reject |
| Sort | Queue ordering | Prioritize pods in queue | Less function |
| PreFilter | Once per cycle | Compute shared filter state | Status |
| Filter | Per node | Eliminate infeasible nodes | Status (pass/fail) |
| PostFilter | After no node fits | Try preemption | Status + nominated node |
| PreScore | Once per cycle | Compute shared score state | Status |
| Score | Per node | Rank nodes 0-100 | Score + Status |
| NormalizeScore | After all scores | Normalize to [0,100] | Status |
| Reserve | After node selected | Optimistic reservation | Status |
| Permit | Before binding | Approve/deny/wait | Status + wait time |
| PreBind | Before actual bind | Pre-binding actions | Status |
| Bind | Binding | Bind pod to node | Status |
| PostBind | After binding | Cleanup, notifications | void |
1.3 Built-in Plugins
Section titled “1.3 Built-in Plugins”The default scheduler already uses these plugins:
| Plugin | Extension Points | What It Does |
|---|---|---|
| NodeResourcesFit | PreFilter, Filter | Check CPU/memory availability |
| NodePorts | PreFilter, Filter | Check port availability |
| NodeAffinity | Filter, Score | Node affinity/anti-affinity rules |
| PodTopologySpread | PreFilter, Filter, PreScore, Score | Topology spread constraints |
| TaintToleration | Filter, PreScore, Score | Taint/toleration matching |
| InterPodAffinity | PreFilter, Filter, PreScore, Score | Pod affinity/anti-affinity |
| VolumeBinding | PreFilter, Filter, Reserve, PreBind | PV/PVC binding |
| DefaultPreemption | PostFilter | Preempt lower-priority pods |
| ImageLocality | Score | Prefer nodes with cached images |
| BalancedAllocation | Score | Balance resource usage across nodes |
Part 2: Writing a Custom Score Plugin
Section titled “Part 2: Writing a Custom Score Plugin”2.1 Project Structure
Section titled “2.1 Project Structure”scheduler-plugins/├── go.mod├── go.sum├── cmd/│ └── scheduler/│ └── main.go # Entry point├── pkg/│ └── plugins/│ └── nodepreference/│ ├── nodepreference.go # Plugin implementation│ └── nodepreference_test.go└── manifests/ ├── scheduler-config.yaml # KubeSchedulerConfiguration └── deployment.yaml # Secondary scheduler deployment2.2 The Score Plugin
Section titled “2.2 The Score Plugin”This plugin scores nodes based on a custom label. Nodes labeled scheduling.kubedojo.io/tier: premium get a higher score than standard or unlabeled nodes:
package nodepreference
import ( "context" "fmt"
v1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/runtime" "k8s.io/kubernetes/pkg/scheduler/framework")
const ( // Name is the name of the plugin. Name = "NodePreference"
// LabelKey is the node label key used for scoring. LabelKey = "scheduling.kubedojo.io/tier")
// NodePreference scores nodes based on a tier label.type NodePreference struct { handle framework.Handle args NodePreferenceArgs}
// NodePreferenceArgs are the arguments for the plugin.type NodePreferenceArgs struct { metav1.TypeMeta `json:",inline"`
// TierScores maps tier label values to scores (0-100). TierScores map[string]int64 `json:"tierScores"`
// DefaultScore is the score for nodes without the tier label. DefaultScore int64 `json:"defaultScore"`}
var _ framework.ScorePlugin = &NodePreference{}var _ framework.EnqueueExtensions = &NodePreference{}
// Name returns the name of the plugin.func (pl *NodePreference) Name() string { return Name}
// Score scores a node based on its tier label.func (pl *NodePreference) Score( ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string,) (int64, *framework.Status) {
// Get the node info from the snapshot nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName) if err != nil { return 0, framework.AsStatus(fmt.Errorf("getting node %q: %w", nodeName, err)) }
node := nodeInfo.Node()
// Check for the tier label tierValue, exists := node.Labels[LabelKey] if !exists { return pl.args.DefaultScore, nil }
// Look up the score for this tier score, found := pl.args.TierScores[tierValue] if !found { return pl.args.DefaultScore, nil }
return score, nil}
// ScoreExtensions returns the score extension functions.func (pl *NodePreference) ScoreExtensions() framework.ScoreExtensions { return pl}
// NormalizeScore normalizes the scores to [0, MaxNodeScore].func (pl *NodePreference) NormalizeScore( ctx context.Context, state *framework.CycleState, pod *v1.Pod, scores framework.NodeScoreList,) *framework.Status {
// Find max score var maxScore int64 for i := range scores { if scores[i].Score > maxScore { maxScore = scores[i].Score } }
// Normalize to [0, 100] if maxScore == 0 { return nil }
for i := range scores { scores[i].Score = (scores[i].Score * framework.MaxNodeScore) / maxScore }
return nil}
// EventsToRegister returns the events that trigger rescheduling.func (pl *NodePreference) EventsToRegister() []framework.ClusterEventWithHint { return []framework.ClusterEventWithHint{ {ClusterEvent: framework.ClusterEvent{Resource: framework.Node, ActionType: framework.Add | framework.Update}}, }}
// New creates a new NodePreference plugin.func New(ctx context.Context, obj runtime.Object, handle framework.Handle) (framework.Plugin, error) { args, ok := obj.(*NodePreferenceArgs) if !ok { return nil, fmt.Errorf("want args to be of type NodePreferenceArgs, got %T", obj) }
return &NodePreference{ handle: handle, args: *args, }, nil}Stop and think: In a cluster with thousands of nodes and hundreds of pods being scheduled per second, reading labels directly from the API server inside the
Scorefunction would cause massive latency. How does the Scheduling Framework prevent this API server bottleneck when you callpl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)?
2.3 Writing a Filter Plugin
Section titled “2.3 Writing a Filter Plugin”A filter plugin eliminates nodes that do not meet certain criteria:
package gpufilter
import ( "context" "fmt" "strconv"
v1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/runtime" "k8s.io/kubernetes/pkg/scheduler/framework")
const ( Name = "GPUFilter" GPUCountLabel = "gpu.kubedojo.io/count" GPUTypeLabel = "gpu.kubedojo.io/type" PodGPUAnnotation = "scheduling.kubedojo.io/gpu-type")
type GPUFilter struct { handle framework.Handle}
var _ framework.FilterPlugin = &GPUFilter{}var _ framework.PreFilterPlugin = &GPUFilter{}
func (pl *GPUFilter) Name() string { return Name}
// PreFilter checks if the pod needs GPU scheduling at all.type preFilterState struct { requiredGPUType string needsGPU bool}
func (s *preFilterState) Clone() framework.StateData { return &preFilterState{ requiredGPUType: s.requiredGPUType, needsGPU: s.needsGPU, }}
const preFilterStateKey = "PreFilter" + Name
func (pl *GPUFilter) PreFilter( ctx context.Context, state *framework.CycleState, pod *v1.Pod,) (*framework.PreFilterResult, *framework.Status) {
gpuType := pod.Annotations[PodGPUAnnotation] pfs := &preFilterState{ requiredGPUType: gpuType, needsGPU: gpuType != "", }
state.Write(preFilterStateKey, pfs)
if !pfs.needsGPU { // Skip the filter entirely — this pod doesn't need GPU return nil, framework.NewStatus(framework.Skip) }
return nil, nil}
func (pl *GPUFilter) PreFilterExtensions() framework.PreFilterExtensions { return nil}
// Filter checks if a node has the required GPU type and available GPUs.func (pl *GPUFilter) Filter( ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo,) *framework.Status {
// Read pre-filter state data, err := state.Read(preFilterStateKey) if err != nil { return framework.AsStatus(fmt.Errorf("reading pre-filter state: %w", err)) } pfs := data.(*preFilterState)
if !pfs.needsGPU { return nil // Should not reach here due to Skip, but be safe }
node := nodeInfo.Node()
// Check GPU type nodeGPUType, exists := node.Labels[GPUTypeLabel] if !exists { return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node %s has no GPU type label", node.Name)) }
if nodeGPUType != pfs.requiredGPUType { return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node has GPU type %q, pod requires %q", nodeGPUType, pfs.requiredGPUType)) }
// Check GPU count gpuCountStr, exists := node.Labels[GPUCountLabel] if !exists { return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node %s has no GPU count label", node.Name)) }
gpuCount, err := strconv.Atoi(gpuCountStr) if err != nil || gpuCount <= 0 { return framework.NewStatus(framework.Unschedulable, fmt.Sprintf("node %s has invalid GPU count: %s", node.Name, gpuCountStr)) }
return nil}
func New(ctx context.Context, obj runtime.Object, handle framework.Handle) (framework.Plugin, error) { return &GPUFilter{handle: handle}, nil}Part 3: Building and Registering Plugins
Section titled “Part 3: Building and Registering Plugins”3.1 The Main Entry Point
Section titled “3.1 The Main Entry Point”package main
import ( "os"
"k8s.io/component-base/cli" "k8s.io/kubernetes/cmd/kube-scheduler/app"
"github.com/kubedojo/scheduler-plugins/pkg/plugins/gpufilter" "github.com/kubedojo/scheduler-plugins/pkg/plugins/nodepreference")
func main() { command := app.NewSchedulerCommand( app.WithPlugin(nodepreference.Name, nodepreference.New), app.WithPlugin(gpufilter.Name, gpufilter.New), )
code := cli.Run(command) os.Exit(code)}3.2 Building the Scheduler Binary
Section titled “3.2 Building the Scheduler Binary”# Initialize Go modulecd ~/extending-k8s/scheduler-pluginsgo mod init github.com/kubedojo/scheduler-plugins
# Important: Pin to the same Kubernetes version as your clusterK8S_VERSION=v1.32.0go get k8s.io/kubernetes@$K8S_VERSIONgo get k8s.io/component-base@$K8S_VERSION
go mod tidygo build -o custom-scheduler ./cmd/scheduler/3.3 Containerize
Section titled “3.3 Containerize”# DockerfileFROM golang:1.23 AS builderWORKDIR /workspaceCOPY go.mod go.sum ./RUN go mod downloadCOPY . .RUN CGO_ENABLED=0 GOOS=linux go build -o custom-scheduler ./cmd/scheduler/
FROM gcr.io/distroless/static:nonrootCOPY --from=builder /workspace/custom-scheduler /custom-schedulerUSER 65532:65532ENTRYPOINT ["/custom-scheduler"]docker build -t custom-scheduler:v0.1.0 .kind load docker-image custom-scheduler:v0.1.0 --name scheduler-labPart 4: KubeSchedulerConfiguration
Section titled “Part 4: KubeSchedulerConfiguration”4.1 Configuring the Secondary Scheduler
Section titled “4.1 Configuring the Secondary Scheduler”apiVersion: kubescheduler.config.k8s.io/v1kind: KubeSchedulerConfigurationleaderElection: leaderElect: true resourceNamespace: kube-system resourceName: custom-schedulerprofiles:- schedulerName: custom-scheduler # Pods reference this name plugins: # Enable our custom plugins filter: enabled: - name: GPUFilter score: enabled: - name: NodePreference weight: 25 # Weight relative to other score plugins # Disable built-in plugins we're replacing # (usually you keep them all and just add yours)
pluginConfig: - name: NodePreference args: tierScores: premium: 100 standard: 50 burstable: 20 defaultScore: 104.2 Deploying the Secondary Scheduler
Section titled “4.2 Deploying the Secondary Scheduler”apiVersion: apps/v1kind: Deploymentmetadata: name: custom-scheduler namespace: kube-system labels: component: custom-schedulerspec: replicas: 2 # HA with leader election selector: matchLabels: component: custom-scheduler template: metadata: labels: component: custom-scheduler spec: serviceAccountName: custom-scheduler containers: - name: scheduler image: custom-scheduler:v0.1.0 command: - /custom-scheduler - --config=/etc/scheduler/config.yaml - --v=2 volumeMounts: - name: config mountPath: /etc/scheduler resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 256Mi livenessProbe: httpGet: path: /healthz port: 10259 scheme: HTTPS initialDelaySeconds: 15 readinessProbe: httpGet: path: /healthz port: 10259 scheme: HTTPS volumes: - name: config configMap: name: custom-scheduler-config---apiVersion: v1kind: ConfigMapmetadata: name: custom-scheduler-config namespace: kube-systemdata: config.yaml: | apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration leaderElection: leaderElect: true resourceNamespace: kube-system resourceName: custom-scheduler profiles: - schedulerName: custom-scheduler plugins: score: enabled: - name: NodePreference weight: 25 pluginConfig: - name: NodePreference args: tierScores: premium: 100 standard: 50 burstable: 20 defaultScore: 104.3 RBAC for the Custom Scheduler
Section titled “4.3 RBAC for the Custom Scheduler”apiVersion: v1kind: ServiceAccountmetadata: name: custom-scheduler namespace: kube-system---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: custom-schedulerrules:- apiGroups: [""] resources: ["pods", "nodes", "namespaces", "configmaps", "endpoints"] verbs: ["get", "list", "watch"]- apiGroups: [""] resources: ["pods/binding", "pods/status"] verbs: ["create", "update", "patch"]- apiGroups: [""] resources: ["events"] verbs: ["create", "patch", "update"]- apiGroups: ["coordination.k8s.io"] resources: ["leases"] verbs: ["get", "list", "watch", "create", "update", "patch"]- apiGroups: ["apps"] resources: ["replicasets", "statefulsets"] verbs: ["get", "list", "watch"]- apiGroups: ["policy"] resources: ["poddisruptionbudgets"] verbs: ["get", "list", "watch"]- apiGroups: ["storage.k8s.io"] resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"] verbs: ["get", "list", "watch"]---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata: name: custom-schedulerroleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: custom-schedulersubjects:- kind: ServiceAccount name: custom-scheduler namespace: kube-systemPart 5: Using the Custom Scheduler
Section titled “Part 5: Using the Custom Scheduler”5.1 Pods Requesting the Custom Scheduler
Section titled “5.1 Pods Requesting the Custom Scheduler”apiVersion: v1kind: Podmetadata: name: gpu-workload annotations: scheduling.kubedojo.io/gpu-type: "a100"spec: schedulerName: custom-scheduler # Use our custom scheduler containers: - name: training image: nvidia/cuda:12.0-base resources: limits: nvidia.com/gpu: 15.2 Multiple Scheduler Profiles
Section titled “5.2 Multiple Scheduler Profiles”A single scheduler binary can serve multiple profiles:
apiVersion: kubescheduler.config.k8s.io/v1kind: KubeSchedulerConfigurationprofiles:- schedulerName: gpu-scheduler plugins: filter: enabled: - name: GPUFilter score: enabled: - name: NodePreference weight: 50
- schedulerName: low-latency-scheduler plugins: score: enabled: - name: NodePreference weight: 80 disabled: - name: ImageLocality # Disable image locality for latency workloads pluginConfig: - name: NodePreference args: tierScores: edge: 100 regional: 60 defaultScore: 05.3 Debugging Scheduling Decisions
Section titled “5.3 Debugging Scheduling Decisions”# Check scheduler events for a podk describe pod gpu-workload | grep -A 15 "Events:"
# Look for scheduling failuresk get events --field-selector reason=FailedScheduling --sort-by=.lastTimestamp
# View scheduler logsk logs -n kube-system -l component=custom-scheduler -f --tail=100
# Check if the custom scheduler is registeredk get pods -n kube-system -l component=custom-scheduler
# Verify a pod is using the custom schedulerk get pod gpu-workload -o jsonpath='{.spec.schedulerName}'Part 6: Advanced Topics
Section titled “Part 6: Advanced Topics”6.1 Scheduler Profiles vs Multiple Schedulers
Section titled “6.1 Scheduler Profiles vs Multiple Schedulers”| Approach | Pros | Cons |
|---|---|---|
| Multiple profiles (one binary) | Shared cache, single deployment | Same plugins available for all profiles |
| Multiple schedulers (separate binaries) | Complete isolation, different plugins | Higher resource usage, separate caches |
6.2 Plugin Weights
Section titled “6.2 Plugin Weights”When multiple Score plugins run, their results are combined:
final_score(node) = SUM(plugin_score(node) * plugin_weight) / SUM(plugin_weights)plugins: score: enabled: - name: NodeResourcesFit weight: 1 # Default - name: NodePreference weight: 25 # 25x more important than default - name: InterPodAffinity weight: 2 # 2x default6.3 Preemption
Section titled “6.3 Preemption”When no node can fit a pod, PostFilter plugins try preemption:
High-priority Pod cannot be scheduled │ ▼PostFilter: DefaultPreemption │ ├── Find nodes where evicting lower-priority Pods would make room │ ├── Select victim Pods (prefer lowest priority, fewest evictions) │ ├── Set pod.Status.NominatedNodeName │ └── Evict victim Pods → retry schedulingPause and predict: If a custom
PostFilterplugin successfully preempts lower-priority pods to make room for a critical workload, does thePostFilterplugin directly bind the pending pod to the newly freed node? What must happen next?
Custom PostFilter plugins can implement alternative preemption strategies.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Not pinning Kubernetes version | Build breaks with dependency conflicts | Pin go.mod to exact cluster K8s version |
| Forgetting RBAC for the scheduler | Scheduler cannot read nodes/pods | Apply comprehensive ClusterRole |
| Score plugin returning > 100 | Panic or wrong normalization | Always return 0-100, use NormalizeScore |
| Filter plugin blocking all nodes | Pod stuck in Pending forever | Add fallback or make filter optional |
| No leader election on multi-replica | Duplicate scheduling decisions | Enable leader election in config |
Wrong schedulerName in pod spec | Pod uses default scheduler, not custom | Verify the name matches the profile name exactly |
| Slow Score plugin | Scheduling latency spikes | Keep scoring O(1) per node, precompute in PreScore |
| Not handling missing labels gracefully | Panics or nil pointer errors | Always check label existence before using |
| Forgetting to register plugin in main.go | Plugin silently not loaded | Use app.WithPlugin() in the scheduler command |
-
You are developing a plugin to ensure machine learning workloads only land on nodes with specific compliance certifications. Another plugin is needed to distribute these workloads across multiple availability zones to minimize blast radius. Which plugin types should you use for each requirement, and why?
Answer
You should use a Filter plugin for the compliance certifications and a Score plugin for the availability zone distribution. The compliance requirement is a hard constraint; if a node lacks the certification, it must be completely eliminated from consideration, which is exactly what a Filter plugin does by returning a pass/fail status. The availability zone distribution is a soft preference; all certified nodes are technically valid, but you want to rank nodes in underutilized zones higher using a Score plugin. The scheduler will then evaluate the surviving nodes and place the Pod on the one with the highest normalized score. -
Your cluster administrator has deployed a new
gpu-scheduleralongside thedefault-scheduler. You submit a Deployment with a pod template that requires a GPU, but you forget to add any scheduler-specific fields to the manifest. Thegpu-scheduleris perfectly configured to handle this workload. What will happen to your Pods?Answer
Your Pods will be processed by the `default-scheduler` instead of the `gpu-scheduler` and may remain in a Pending state if the default scheduler lacks the necessary logic to place them. By default, if the `spec.schedulerName` field is omitted from a Pod's specification, the Kubernetes API server implicitly assigns it to the `default-scheduler`. The secondary `gpu-scheduler` operates completely independently and only watches for Pods explicitly requesting its exact profile name. To fix this, you must update the pod template to include `schedulerName: gpu-scheduler` so the default scheduler ignores it and the custom scheduler picks it up. -
You are writing a Filter plugin that needs to perform a complex, computationally expensive calculation based on a Pod’s annotations to determine compatibility. If you perform this calculation inside the
Filterextension point on a 5,000-node cluster, you notice significant scheduling delays. How can you redesign your plugin to resolve this performance bottleneck?Answer
You should move the computationally expensive calculation into the `PreFilter` extension point and store the result in the `CycleState`. The `Filter` extension point is invoked individually for every single feasible node in the cluster, meaning your calculation was being executed up to 5,000 times per Pod. The `PreFilter` extension point, however, is invoked exactly once per scheduling cycle before any node filtering begins. By computing the value once in `PreFilter` and reading that shared state inside the `Filter` function, you reduce the time complexity significantly and eliminate the scheduling delays. -
Your team creates a custom Score plugin that assigns a score of 500 to nodes with high network bandwidth and 10 to nodes with low bandwidth. However, during testing, you notice that the default scheduler plugins (like
NodeResourcesFit) are completely ignoring your scores, and workloads are not being placed optimally. What architectural requirement of the Scheduling Framework did your team violate?Answer
Your team violated the requirement that all final scores must be normalized to a standard range, specifically between 0 and 100 (`framework.MaxNodeScore`). When a plugin returns raw scores outside this boundary, it must implement the `NormalizeScore` extension point to mathematically map its internal scoring system down to the 0-100 scale. Because your plugin returned a raw score of 500 without normalizing it, the framework either rejected the score or improperly weighted it against built-in plugins that correctly operate within the 0-100 range. Implementing the `NormalizeScore` function to scale 500 down to 100 will fix the aggregation issue. -
You deploy a mission-critical Pod with
schedulerName: fast-scheduler, but thefast-schedulerdeployment has crashed and currently has zero running replicas. Thedefault-scheduleris perfectly healthy and capable of placing the Pod. How will the cluster handle this failure scenario to ensure the Pod gets scheduled?Answer
The cluster will not schedule the Pod at all, and it will remain in a Pending state indefinitely until the `fast-scheduler` is restored. Kubernetes does not have a fallback mechanism for scheduler assignment; the `spec.schedulerName` is a strict, exclusive contract. The `default-scheduler` explicitly filters out any Pods that do not match its own name, meaning it will completely ignore your mission-critical Pod. This architectural design prevents race conditions and conflicts that would occur if multiple schedulers attempted to bind the same Pod simultaneously. -
You have configured a custom KubeSchedulerConfiguration profile with your
NodePreferenceplugin and the built-inInterPodAffinityplugin. Your plugin correctly scores premium nodes at 100, but Pods are still consistently landing on standard nodes (score 50) because they already contain other Pods from the same application. How can you configure the scheduler to prioritize your custom tier preference over the built-in pod affinity?Answer
You need to adjust the `weight` field for your `NodePreference` plugin within the `KubeSchedulerConfiguration` profile to be significantly higher than the weight of the `InterPodAffinity` plugin. The final node score is a weighted sum of all active Score plugins, meaning a plugin with a higher weight has a proportionally larger mathematical impact on the final decision. By default, if your plugin has a weight of 1 and `InterPodAffinity` has a weight of 5, the built-in affinity will easily override your tier preference. Increasing your plugin's weight to 10 or 20 will ensure the framework mathematically favors premium nodes even when pod affinity suggests otherwise.
Hands-On Exercise
Section titled “Hands-On Exercise”Task: Build a custom Score plugin that prefers nodes with a specific tier label, configure it via KubeSchedulerConfiguration, deploy it as a secondary scheduler, and verify scheduling decisions.
Setup:
kind create cluster --name scheduler-lab --config - <<EOFkind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes:- role: control-plane- role: worker- role: worker- role: workerEOFSteps:
- Label the nodes with tiers:
# Get worker node namesNODES=$(k get nodes --no-headers -o custom-columns=':metadata.name' | grep -v control-plane)
# Label themNODE1=$(echo "$NODES" | sed -n '1p')NODE2=$(echo "$NODES" | sed -n '2p')NODE3=$(echo "$NODES" | sed -n '3p')
k label node "$NODE1" scheduling.kubedojo.io/tier=premiumk label node "$NODE2" scheduling.kubedojo.io/tier=standardk label node "$NODE3" scheduling.kubedojo.io/tier=burstable
# Verify labelsk get nodes --show-labels | grep kubedojo-
Create the Go project from the code in Parts 2 and 3
-
Build and load the scheduler image:
docker build -t custom-scheduler:v0.1.0 .kind load docker-image custom-scheduler:v0.1.0 --name scheduler-lab-
Deploy RBAC, ConfigMap, and Deployment from Part 4
-
Verify the custom scheduler is running:
k get pods -n kube-system -l component=custom-schedulerk logs -n kube-system -l component=custom-scheduler --tail=20- Create test Pods using the custom scheduler:
# Create 5 pods with the custom schedulerfor i in $(seq 1 5); do k run test-$i --image=nginx --restart=Never \ --overrides='{ "spec": { "schedulerName": "custom-scheduler" } }'done
# Check which nodes they landed onk get pods -o wide | grep test-# Most should be on the "premium" node due to higher score- Verify with events:
k describe pod test-1 | grep -A 5 "Events:"# Should show "Scheduled" event from "custom-scheduler"- Test with the default scheduler for comparison:
for i in $(seq 1 5); do k run default-$i --image=nginx --restart=Neverdonek get pods -o wide | grep default-# Should be distributed more evenly (default scheduler does not know about tiers)- Cleanup:
kind delete cluster --name scheduler-labSuccess Criteria:
- Three worker nodes labeled with different tiers
- Custom scheduler deploys and reports healthy
- Pods with
schedulerName: custom-schedulerare scheduled - Premium-tier node receives more pods than burstable
- Events show the custom scheduler name
- Default scheduler pods distribute differently
- Scheduler logs show Score plugin execution
Next Module
Section titled “Next Module”Module 1.8: API Aggregation & Extension API Servers - Build custom API servers that extend the Kubernetes API beyond what CRDs can offer.