Перейти до вмісту

Module 1.7: Customizing the Scheduler

Цей контент ще не доступний вашою мовою.

Complexity: [COMPLEX] - Extending Kubernetes scheduling decisions

Time to Complete: 4 hours

Prerequisites: Module 1.1 (API Deep Dive), understanding of Pod scheduling basics


After completing this module, you will be able to:

  1. Implement a custom Filter plugin that excludes nodes based on real-time conditions (GPU utilization, rack topology, compliance labels)
  2. Implement a custom Score plugin that ranks nodes using business-specific criteria like data locality or cost optimization
  3. Deploy a scheduler plugin alongside the default scheduler using KubeSchedulerConfiguration profiles
  4. Evaluate when to use scheduler plugins vs. scheduling constraints (affinity, taints) for a given workload placement requirement

The default Kubernetes scheduler is remarkably capable — it handles affinity, anti-affinity, resource requests, taints, tolerations, topology constraints, and more. But there are scheduling decisions it cannot make out of the box. What if you want to schedule Pods based on real-time GPU utilization instead of static resource requests? What if you need to colocate certain workloads on the same rack for network latency? What if your compliance requirements demand scheduling based on data residency labels?

The Scheduling Framework is the answer. Since Kubernetes 1.19, the scheduler is built as a plugin system with well-defined extension points. You can write plugins in Go that hook into any stage of the scheduling pipeline — filtering nodes, scoring them, binding Pods, or even preempting lower-priority workloads. You can also run multiple schedulers side by side, each with its own personality.

The Restaurant Seating Analogy

The default scheduler is like a restaurant host who seats guests based on table availability and party size. A custom Filter plugin adds rules: “this party requested a window seat” (node affinity). A custom Score plugin adds preferences: “seat regulars closer to the kitchen for faster service” (colocation scoring). A custom Bind plugin changes the reservation process: “hold this table for 5 minutes while we confirm the reservation” (custom binding). The host still does the work; your plugins influence the decisions.


By the end of this module, you will be able to:

  • Understand the complete scheduling framework and its extension points
  • Write custom Filter and Score plugins in Go
  • Configure KubeSchedulerConfiguration for custom plugins
  • Deploy a secondary scheduler alongside the default one
  • Debug scheduling decisions using events and logs

  • The default scheduler evaluates up to 100 nodes per scheduling cycle (configurable via percentageOfNodesToScore). For a 5,000-node cluster, it does not check every node — it samples and picks the best from the sample. This is why scheduling is fast even at massive scale.

  • Kubernetes supports running multiple schedulers simultaneously: You can have the default scheduler for most workloads and a custom scheduler for GPU workloads, each with different plugins enabled. Pods declare which scheduler to use via spec.schedulerName.

  • The Scheduling Framework is the recommended extension mechanism since Kubernetes 1.19. Scheduler extenders (webhook-based) and standalone custom schedulers are still supported but the Framework offers better performance and deeper integration. Scheduler policy files are deprecated.


flowchart TD
Start([Pod enters scheduling queue]) --> PreEnqueue
subgraph Queueing Phase
PreEnqueue[1. PreEnqueue<br/>Reject pods before queuing] --> Sort[2. Sort<br/>Order pods in the queue]
end
Sort --> PreFilter
subgraph Scheduling Cycle
PreFilter[3. PreFilter<br/>Compute shared state] --> Filter[4. Filter<br/>Eliminate infeasible nodes]
Filter -- If no nodes fit --> PostFilter[5. PostFilter<br/>Handle preemption]
Filter -- If nodes pass --> PreScore[6. PreScore<br/>Compute shared score state]
PostFilter -. Retry scheduling .-> PreFilter
PreScore --> Score[7. Score<br/>Rank feasible nodes 0-100]
Score --> Normalize[8. NormalizeScore<br/>Normalize scores]
Normalize --> Reserve[9. Reserve<br/>Optimistically assume placement]
Reserve --> Permit[10. Permit<br/>Hold, allow, or deny binding]
end
Permit --> PreBind
subgraph Binding Cycle
PreBind[11. PreBind<br/>Pre-binding operations] --> Bind[12. Bind<br/>Actually bind Pod to Node]
Bind --> PostBind[13. PostBind<br/>Informational, after binding]
end
Extension PointWhen It RunsWhat It DoesReturn Type
PreEnqueueBefore queuingGate pods from entering queueAllow/Reject
SortQueue orderingPrioritize pods in queueLess function
PreFilterOnce per cycleCompute shared filter stateStatus
FilterPer nodeEliminate infeasible nodesStatus (pass/fail)
PostFilterAfter no node fitsTry preemptionStatus + nominated node
PreScoreOnce per cycleCompute shared score stateStatus
ScorePer nodeRank nodes 0-100Score + Status
NormalizeScoreAfter all scoresNormalize to [0,100]Status
ReserveAfter node selectedOptimistic reservationStatus
PermitBefore bindingApprove/deny/waitStatus + wait time
PreBindBefore actual bindPre-binding actionsStatus
BindBindingBind pod to nodeStatus
PostBindAfter bindingCleanup, notificationsvoid

The default scheduler already uses these plugins:

PluginExtension PointsWhat It Does
NodeResourcesFitPreFilter, FilterCheck CPU/memory availability
NodePortsPreFilter, FilterCheck port availability
NodeAffinityFilter, ScoreNode affinity/anti-affinity rules
PodTopologySpreadPreFilter, Filter, PreScore, ScoreTopology spread constraints
TaintTolerationFilter, PreScore, ScoreTaint/toleration matching
InterPodAffinityPreFilter, Filter, PreScore, ScorePod affinity/anti-affinity
VolumeBindingPreFilter, Filter, Reserve, PreBindPV/PVC binding
DefaultPreemptionPostFilterPreempt lower-priority pods
ImageLocalityScorePrefer nodes with cached images
BalancedAllocationScoreBalance resource usage across nodes

scheduler-plugins/
├── go.mod
├── go.sum
├── cmd/
│ └── scheduler/
│ └── main.go # Entry point
├── pkg/
│ └── plugins/
│ └── nodepreference/
│ ├── nodepreference.go # Plugin implementation
│ └── nodepreference_test.go
└── manifests/
├── scheduler-config.yaml # KubeSchedulerConfiguration
└── deployment.yaml # Secondary scheduler deployment

This plugin scores nodes based on a custom label. Nodes labeled scheduling.kubedojo.io/tier: premium get a higher score than standard or unlabeled nodes:

pkg/plugins/nodepreference/nodepreference.go
package nodepreference
import (
"context"
"fmt"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
// Name is the name of the plugin.
Name = "NodePreference"
// LabelKey is the node label key used for scoring.
LabelKey = "scheduling.kubedojo.io/tier"
)
// NodePreference scores nodes based on a tier label.
type NodePreference struct {
handle framework.Handle
args NodePreferenceArgs
}
// NodePreferenceArgs are the arguments for the plugin.
type NodePreferenceArgs struct {
metav1.TypeMeta `json:",inline"`
// TierScores maps tier label values to scores (0-100).
TierScores map[string]int64 `json:"tierScores"`
// DefaultScore is the score for nodes without the tier label.
DefaultScore int64 `json:"defaultScore"`
}
var _ framework.ScorePlugin = &NodePreference{}
var _ framework.EnqueueExtensions = &NodePreference{}
// Name returns the name of the plugin.
func (pl *NodePreference) Name() string {
return Name
}
// Score scores a node based on its tier label.
func (pl *NodePreference) Score(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
nodeName string,
) (int64, *framework.Status) {
// Get the node info from the snapshot
nodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
if err != nil {
return 0, framework.AsStatus(fmt.Errorf("getting node %q: %w", nodeName, err))
}
node := nodeInfo.Node()
// Check for the tier label
tierValue, exists := node.Labels[LabelKey]
if !exists {
return pl.args.DefaultScore, nil
}
// Look up the score for this tier
score, found := pl.args.TierScores[tierValue]
if !found {
return pl.args.DefaultScore, nil
}
return score, nil
}
// ScoreExtensions returns the score extension functions.
func (pl *NodePreference) ScoreExtensions() framework.ScoreExtensions {
return pl
}
// NormalizeScore normalizes the scores to [0, MaxNodeScore].
func (pl *NodePreference) NormalizeScore(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
scores framework.NodeScoreList,
) *framework.Status {
// Find max score
var maxScore int64
for i := range scores {
if scores[i].Score > maxScore {
maxScore = scores[i].Score
}
}
// Normalize to [0, 100]
if maxScore == 0 {
return nil
}
for i := range scores {
scores[i].Score = (scores[i].Score * framework.MaxNodeScore) / maxScore
}
return nil
}
// EventsToRegister returns the events that trigger rescheduling.
func (pl *NodePreference) EventsToRegister() []framework.ClusterEventWithHint {
return []framework.ClusterEventWithHint{
{ClusterEvent: framework.ClusterEvent{Resource: framework.Node, ActionType: framework.Add | framework.Update}},
}
}
// New creates a new NodePreference plugin.
func New(ctx context.Context, obj runtime.Object, handle framework.Handle) (framework.Plugin, error) {
args, ok := obj.(*NodePreferenceArgs)
if !ok {
return nil, fmt.Errorf("want args to be of type NodePreferenceArgs, got %T", obj)
}
return &NodePreference{
handle: handle,
args: *args,
}, nil
}

Stop and think: In a cluster with thousands of nodes and hundreds of pods being scheduled per second, reading labels directly from the API server inside the Score function would cause massive latency. How does the Scheduling Framework prevent this API server bottleneck when you call pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)?

A filter plugin eliminates nodes that do not meet certain criteria:

pkg/plugins/gpufilter/gpufilter.go
package gpufilter
import (
"context"
"fmt"
"strconv"
v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/kubernetes/pkg/scheduler/framework"
)
const (
Name = "GPUFilter"
GPUCountLabel = "gpu.kubedojo.io/count"
GPUTypeLabel = "gpu.kubedojo.io/type"
PodGPUAnnotation = "scheduling.kubedojo.io/gpu-type"
)
type GPUFilter struct {
handle framework.Handle
}
var _ framework.FilterPlugin = &GPUFilter{}
var _ framework.PreFilterPlugin = &GPUFilter{}
func (pl *GPUFilter) Name() string {
return Name
}
// PreFilter checks if the pod needs GPU scheduling at all.
type preFilterState struct {
requiredGPUType string
needsGPU bool
}
func (s *preFilterState) Clone() framework.StateData {
return &preFilterState{
requiredGPUType: s.requiredGPUType,
needsGPU: s.needsGPU,
}
}
const preFilterStateKey = "PreFilter" + Name
func (pl *GPUFilter) PreFilter(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
) (*framework.PreFilterResult, *framework.Status) {
gpuType := pod.Annotations[PodGPUAnnotation]
pfs := &preFilterState{
requiredGPUType: gpuType,
needsGPU: gpuType != "",
}
state.Write(preFilterStateKey, pfs)
if !pfs.needsGPU {
// Skip the filter entirely — this pod doesn't need GPU
return nil, framework.NewStatus(framework.Skip)
}
return nil, nil
}
func (pl *GPUFilter) PreFilterExtensions() framework.PreFilterExtensions {
return nil
}
// Filter checks if a node has the required GPU type and available GPUs.
func (pl *GPUFilter) Filter(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
nodeInfo *framework.NodeInfo,
) *framework.Status {
// Read pre-filter state
data, err := state.Read(preFilterStateKey)
if err != nil {
return framework.AsStatus(fmt.Errorf("reading pre-filter state: %w", err))
}
pfs := data.(*preFilterState)
if !pfs.needsGPU {
return nil // Should not reach here due to Skip, but be safe
}
node := nodeInfo.Node()
// Check GPU type
nodeGPUType, exists := node.Labels[GPUTypeLabel]
if !exists {
return framework.NewStatus(framework.Unschedulable,
fmt.Sprintf("node %s has no GPU type label", node.Name))
}
if nodeGPUType != pfs.requiredGPUType {
return framework.NewStatus(framework.Unschedulable,
fmt.Sprintf("node has GPU type %q, pod requires %q",
nodeGPUType, pfs.requiredGPUType))
}
// Check GPU count
gpuCountStr, exists := node.Labels[GPUCountLabel]
if !exists {
return framework.NewStatus(framework.Unschedulable,
fmt.Sprintf("node %s has no GPU count label", node.Name))
}
gpuCount, err := strconv.Atoi(gpuCountStr)
if err != nil || gpuCount <= 0 {
return framework.NewStatus(framework.Unschedulable,
fmt.Sprintf("node %s has invalid GPU count: %s", node.Name, gpuCountStr))
}
return nil
}
func New(ctx context.Context, obj runtime.Object, handle framework.Handle) (framework.Plugin, error) {
return &GPUFilter{handle: handle}, nil
}

cmd/scheduler/main.go
package main
import (
"os"
"k8s.io/component-base/cli"
"k8s.io/kubernetes/cmd/kube-scheduler/app"
"github.com/kubedojo/scheduler-plugins/pkg/plugins/gpufilter"
"github.com/kubedojo/scheduler-plugins/pkg/plugins/nodepreference"
)
func main() {
command := app.NewSchedulerCommand(
app.WithPlugin(nodepreference.Name, nodepreference.New),
app.WithPlugin(gpufilter.Name, gpufilter.New),
)
code := cli.Run(command)
os.Exit(code)
}
Terminal window
# Initialize Go module
cd ~/extending-k8s/scheduler-plugins
go mod init github.com/kubedojo/scheduler-plugins
# Important: Pin to the same Kubernetes version as your cluster
K8S_VERSION=v1.32.0
go get k8s.io/kubernetes@$K8S_VERSION
go get k8s.io/component-base@$K8S_VERSION
go mod tidy
go build -o custom-scheduler ./cmd/scheduler/
# Dockerfile
FROM golang:1.23 AS builder
WORKDIR /workspace
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o custom-scheduler ./cmd/scheduler/
FROM gcr.io/distroless/static:nonroot
COPY --from=builder /workspace/custom-scheduler /custom-scheduler
USER 65532:65532
ENTRYPOINT ["/custom-scheduler"]
Terminal window
docker build -t custom-scheduler:v0.1.0 .
kind load docker-image custom-scheduler:v0.1.0 --name scheduler-lab

manifests/scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
resourceNamespace: kube-system
resourceName: custom-scheduler
profiles:
- schedulerName: custom-scheduler # Pods reference this name
plugins:
# Enable our custom plugins
filter:
enabled:
- name: GPUFilter
score:
enabled:
- name: NodePreference
weight: 25 # Weight relative to other score plugins
# Disable built-in plugins we're replacing
# (usually you keep them all and just add yours)
pluginConfig:
- name: NodePreference
args:
tierScores:
premium: 100
standard: 50
burstable: 20
defaultScore: 10
manifests/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-scheduler
namespace: kube-system
labels:
component: custom-scheduler
spec:
replicas: 2 # HA with leader election
selector:
matchLabels:
component: custom-scheduler
template:
metadata:
labels:
component: custom-scheduler
spec:
serviceAccountName: custom-scheduler
containers:
- name: scheduler
image: custom-scheduler:v0.1.0
command:
- /custom-scheduler
- --config=/etc/scheduler/config.yaml
- --v=2
volumeMounts:
- name: config
mountPath: /etc/scheduler
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
livenessProbe:
httpGet:
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 15
readinessProbe:
httpGet:
path: /healthz
port: 10259
scheme: HTTPS
volumes:
- name: config
configMap:
name: custom-scheduler-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-scheduler-config
namespace: kube-system
data:
config.yaml: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
leaderElection:
leaderElect: true
resourceNamespace: kube-system
resourceName: custom-scheduler
profiles:
- schedulerName: custom-scheduler
plugins:
score:
enabled:
- name: NodePreference
weight: 25
pluginConfig:
- name: NodePreference
args:
tierScores:
premium: 100
standard: 50
burstable: 20
defaultScore: 10
manifests/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: custom-scheduler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: custom-scheduler
rules:
- apiGroups: [""]
resources: ["pods", "nodes", "namespaces", "configmaps", "endpoints"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["pods/binding", "pods/status"]
verbs: ["create", "update", "patch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch", "update"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: ["apps"]
resources: ["replicasets", "statefulsets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: custom-scheduler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: custom-scheduler
subjects:
- kind: ServiceAccount
name: custom-scheduler
namespace: kube-system

apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
annotations:
scheduling.kubedojo.io/gpu-type: "a100"
spec:
schedulerName: custom-scheduler # Use our custom scheduler
containers:
- name: training
image: nvidia/cuda:12.0-base
resources:
limits:
nvidia.com/gpu: 1

A single scheduler binary can serve multiple profiles:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: gpu-scheduler
plugins:
filter:
enabled:
- name: GPUFilter
score:
enabled:
- name: NodePreference
weight: 50
- schedulerName: low-latency-scheduler
plugins:
score:
enabled:
- name: NodePreference
weight: 80
disabled:
- name: ImageLocality # Disable image locality for latency workloads
pluginConfig:
- name: NodePreference
args:
tierScores:
edge: 100
regional: 60
defaultScore: 0
Terminal window
# Check scheduler events for a pod
k describe pod gpu-workload | grep -A 15 "Events:"
# Look for scheduling failures
k get events --field-selector reason=FailedScheduling --sort-by=.lastTimestamp
# View scheduler logs
k logs -n kube-system -l component=custom-scheduler -f --tail=100
# Check if the custom scheduler is registered
k get pods -n kube-system -l component=custom-scheduler
# Verify a pod is using the custom scheduler
k get pod gpu-workload -o jsonpath='{.spec.schedulerName}'

6.1 Scheduler Profiles vs Multiple Schedulers

Section titled “6.1 Scheduler Profiles vs Multiple Schedulers”
ApproachProsCons
Multiple profiles (one binary)Shared cache, single deploymentSame plugins available for all profiles
Multiple schedulers (separate binaries)Complete isolation, different pluginsHigher resource usage, separate caches

When multiple Score plugins run, their results are combined:

final_score(node) = SUM(plugin_score(node) * plugin_weight) / SUM(plugin_weights)
plugins:
score:
enabled:
- name: NodeResourcesFit
weight: 1 # Default
- name: NodePreference
weight: 25 # 25x more important than default
- name: InterPodAffinity
weight: 2 # 2x default

When no node can fit a pod, PostFilter plugins try preemption:

High-priority Pod cannot be scheduled
PostFilter: DefaultPreemption
├── Find nodes where evicting lower-priority Pods would make room
├── Select victim Pods (prefer lowest priority, fewest evictions)
├── Set pod.Status.NominatedNodeName
└── Evict victim Pods → retry scheduling

Pause and predict: If a custom PostFilter plugin successfully preempts lower-priority pods to make room for a critical workload, does the PostFilter plugin directly bind the pending pod to the newly freed node? What must happen next?

Custom PostFilter plugins can implement alternative preemption strategies.


MistakeProblemSolution
Not pinning Kubernetes versionBuild breaks with dependency conflictsPin go.mod to exact cluster K8s version
Forgetting RBAC for the schedulerScheduler cannot read nodes/podsApply comprehensive ClusterRole
Score plugin returning > 100Panic or wrong normalizationAlways return 0-100, use NormalizeScore
Filter plugin blocking all nodesPod stuck in Pending foreverAdd fallback or make filter optional
No leader election on multi-replicaDuplicate scheduling decisionsEnable leader election in config
Wrong schedulerName in pod specPod uses default scheduler, not customVerify the name matches the profile name exactly
Slow Score pluginScheduling latency spikesKeep scoring O(1) per node, precompute in PreScore
Not handling missing labels gracefullyPanics or nil pointer errorsAlways check label existence before using
Forgetting to register plugin in main.goPlugin silently not loadedUse app.WithPlugin() in the scheduler command

  1. You are developing a plugin to ensure machine learning workloads only land on nodes with specific compliance certifications. Another plugin is needed to distribute these workloads across multiple availability zones to minimize blast radius. Which plugin types should you use for each requirement, and why?

    Answer You should use a Filter plugin for the compliance certifications and a Score plugin for the availability zone distribution. The compliance requirement is a hard constraint; if a node lacks the certification, it must be completely eliminated from consideration, which is exactly what a Filter plugin does by returning a pass/fail status. The availability zone distribution is a soft preference; all certified nodes are technically valid, but you want to rank nodes in underutilized zones higher using a Score plugin. The scheduler will then evaluate the surviving nodes and place the Pod on the one with the highest normalized score.
  2. Your cluster administrator has deployed a new gpu-scheduler alongside the default-scheduler. You submit a Deployment with a pod template that requires a GPU, but you forget to add any scheduler-specific fields to the manifest. The gpu-scheduler is perfectly configured to handle this workload. What will happen to your Pods?

    Answer Your Pods will be processed by the `default-scheduler` instead of the `gpu-scheduler` and may remain in a Pending state if the default scheduler lacks the necessary logic to place them. By default, if the `spec.schedulerName` field is omitted from a Pod's specification, the Kubernetes API server implicitly assigns it to the `default-scheduler`. The secondary `gpu-scheduler` operates completely independently and only watches for Pods explicitly requesting its exact profile name. To fix this, you must update the pod template to include `schedulerName: gpu-scheduler` so the default scheduler ignores it and the custom scheduler picks it up.
  3. You are writing a Filter plugin that needs to perform a complex, computationally expensive calculation based on a Pod’s annotations to determine compatibility. If you perform this calculation inside the Filter extension point on a 5,000-node cluster, you notice significant scheduling delays. How can you redesign your plugin to resolve this performance bottleneck?

    Answer You should move the computationally expensive calculation into the `PreFilter` extension point and store the result in the `CycleState`. The `Filter` extension point is invoked individually for every single feasible node in the cluster, meaning your calculation was being executed up to 5,000 times per Pod. The `PreFilter` extension point, however, is invoked exactly once per scheduling cycle before any node filtering begins. By computing the value once in `PreFilter` and reading that shared state inside the `Filter` function, you reduce the time complexity significantly and eliminate the scheduling delays.
  4. Your team creates a custom Score plugin that assigns a score of 500 to nodes with high network bandwidth and 10 to nodes with low bandwidth. However, during testing, you notice that the default scheduler plugins (like NodeResourcesFit) are completely ignoring your scores, and workloads are not being placed optimally. What architectural requirement of the Scheduling Framework did your team violate?

    Answer Your team violated the requirement that all final scores must be normalized to a standard range, specifically between 0 and 100 (`framework.MaxNodeScore`). When a plugin returns raw scores outside this boundary, it must implement the `NormalizeScore` extension point to mathematically map its internal scoring system down to the 0-100 scale. Because your plugin returned a raw score of 500 without normalizing it, the framework either rejected the score or improperly weighted it against built-in plugins that correctly operate within the 0-100 range. Implementing the `NormalizeScore` function to scale 500 down to 100 will fix the aggregation issue.
  5. You deploy a mission-critical Pod with schedulerName: fast-scheduler, but the fast-scheduler deployment has crashed and currently has zero running replicas. The default-scheduler is perfectly healthy and capable of placing the Pod. How will the cluster handle this failure scenario to ensure the Pod gets scheduled?

    Answer The cluster will not schedule the Pod at all, and it will remain in a Pending state indefinitely until the `fast-scheduler` is restored. Kubernetes does not have a fallback mechanism for scheduler assignment; the `spec.schedulerName` is a strict, exclusive contract. The `default-scheduler` explicitly filters out any Pods that do not match its own name, meaning it will completely ignore your mission-critical Pod. This architectural design prevents race conditions and conflicts that would occur if multiple schedulers attempted to bind the same Pod simultaneously.
  6. You have configured a custom KubeSchedulerConfiguration profile with your NodePreference plugin and the built-in InterPodAffinity plugin. Your plugin correctly scores premium nodes at 100, but Pods are still consistently landing on standard nodes (score 50) because they already contain other Pods from the same application. How can you configure the scheduler to prioritize your custom tier preference over the built-in pod affinity?

    Answer You need to adjust the `weight` field for your `NodePreference` plugin within the `KubeSchedulerConfiguration` profile to be significantly higher than the weight of the `InterPodAffinity` plugin. The final node score is a weighted sum of all active Score plugins, meaning a plugin with a higher weight has a proportionally larger mathematical impact on the final decision. By default, if your plugin has a weight of 1 and `InterPodAffinity` has a weight of 5, the built-in affinity will easily override your tier preference. Increasing your plugin's weight to 10 or 20 will ensure the framework mathematically favors premium nodes even when pod affinity suggests otherwise.

Task: Build a custom Score plugin that prefers nodes with a specific tier label, configure it via KubeSchedulerConfiguration, deploy it as a secondary scheduler, and verify scheduling decisions.

Setup:

Terminal window
kind create cluster --name scheduler-lab --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
EOF

Steps:

  1. Label the nodes with tiers:
Terminal window
# Get worker node names
NODES=$(k get nodes --no-headers -o custom-columns=':metadata.name' | grep -v control-plane)
# Label them
NODE1=$(echo "$NODES" | sed -n '1p')
NODE2=$(echo "$NODES" | sed -n '2p')
NODE3=$(echo "$NODES" | sed -n '3p')
k label node "$NODE1" scheduling.kubedojo.io/tier=premium
k label node "$NODE2" scheduling.kubedojo.io/tier=standard
k label node "$NODE3" scheduling.kubedojo.io/tier=burstable
# Verify labels
k get nodes --show-labels | grep kubedojo
  1. Create the Go project from the code in Parts 2 and 3

  2. Build and load the scheduler image:

Terminal window
docker build -t custom-scheduler:v0.1.0 .
kind load docker-image custom-scheduler:v0.1.0 --name scheduler-lab
  1. Deploy RBAC, ConfigMap, and Deployment from Part 4

  2. Verify the custom scheduler is running:

Terminal window
k get pods -n kube-system -l component=custom-scheduler
k logs -n kube-system -l component=custom-scheduler --tail=20
  1. Create test Pods using the custom scheduler:
Terminal window
# Create 5 pods with the custom scheduler
for i in $(seq 1 5); do
k run test-$i --image=nginx --restart=Never \
--overrides='{
"spec": {
"schedulerName": "custom-scheduler"
}
}'
done
# Check which nodes they landed on
k get pods -o wide | grep test-
# Most should be on the "premium" node due to higher score
  1. Verify with events:
Terminal window
k describe pod test-1 | grep -A 5 "Events:"
# Should show "Scheduled" event from "custom-scheduler"
  1. Test with the default scheduler for comparison:
Terminal window
for i in $(seq 1 5); do
k run default-$i --image=nginx --restart=Never
done
k get pods -o wide | grep default-
# Should be distributed more evenly (default scheduler does not know about tiers)
  1. Cleanup:
Terminal window
kind delete cluster --name scheduler-lab

Success Criteria:

  • Three worker nodes labeled with different tiers
  • Custom scheduler deploys and reports healthy
  • Pods with schedulerName: custom-scheduler are scheduled
  • Premium-tier node receives more pods than burstable
  • Events show the custom scheduler name
  • Default scheduler pods distribute differently
  • Scheduler logs show Score plugin execution

Module 1.8: API Aggregation & Extension API Servers - Build custom API servers that extend the Kubernetes API beyond what CRDs can offer.