Module 9.5: Ray Serve - Distributed Model Serving
Цей контент ще не доступний вашою мовою.
Complexity: [COMPLEX]
Section titled “Complexity: [COMPLEX]”Time to Complete: 90 minutes
Prerequisites: Module 9.4 (vLLM), Basic Python, Kubernetes workload fundamentals, familiarity with CPU and GPU resource requests
Learning Outcomes
Section titled “Learning Outcomes”After completing this module, you will be able to:
- Design a Ray Serve deployment that separates cluster lifecycle, model lifecycle, and request routing concerns on Kubernetes.
- Configure KubeRay
RayClusterandRayServiceresources with CPU, GPU, autoscaling, and health-check settings that match inference workload behavior. - Implement a Ray Serve application that composes multiple model stages into a production-style inference pipeline.
- Debug common Ray Serve failures by tracing symptoms through Kubernetes resources, Ray actors, Serve replicas, and application logs.
- Evaluate when Ray Serve is a better fit than KServe, Triton, Seldon Core, or a plain HTTP service for distributed AI workloads.
Why This Module Matters
Section titled “Why This Module Matters”A platform team is asked to productionize a document-intelligence system that worked during a demo. The demo ran on one large GPU machine with a notebook, a local vector store, and a few helper functions. Production is different: OCR needs CPU replicas, embedding needs smaller GPU slices, generation needs larger GPU allocations, and the business wants every customer request to pass through the same API without learning which model lives where.
The first attempt uses separate Kubernetes Deployments for every model stage. It looks familiar to the platform team, but the system becomes hard to operate. The OCR pods scale on CPU load, the embedding pods sit idle between bursts, the generation pods queue requests during traffic spikes, and every retry rule is duplicated in application code. The team can scale each service, yet the pipeline itself has no shared view of work in flight.
Ray Serve changes the shape of the problem. Instead of treating every model as a separate web service, Ray Serve treats each model stage as a distributed deployment inside a Ray application. KubeRay then gives Kubernetes a native way to manage the Ray runtime, the worker pods, and the Serve application. This is not a replacement for Kubernetes; it is a compute layer that lets Kubernetes host a distributed Python serving system.
Ray Serve matters because many AI platforms are no longer serving one model behind one endpoint. They are serving chains of models, preprocessors, rankers, retrievers, safety filters, and generators with different resource profiles. A senior platform engineer needs to decide where Kubernetes should own scheduling, where Ray should own distributed execution, and where application code should own model behavior. This module teaches that boundary.
Ray Serve in the Serving Stack
Section titled “Ray Serve in the Serving Stack”Ray Serve is an HTTP serving framework built on top of Ray. Ray provides distributed actors, tasks, scheduling, and resource accounting. Serve adds long-running deployments, request routing, autoscaling, health checks, and composition for online inference. KubeRay adds Kubernetes custom resources so that platform teams can manage Ray clusters and Serve applications through normal cluster workflows.
A beginner mistake is to think of Ray Serve as just another Python web framework. That mental model misses the most important part. A Flask or FastAPI service usually runs inside one process per pod, while Ray Serve can route one request across multiple distributed actors that may live on different pods and nodes. The HTTP endpoint is only the front door; the useful work happens inside a distributed graph of deployments.
A senior mistake is to assume Ray Serve replaces all other serving tools. It does not. Triton may be stronger for tightly optimized low-latency GPU inference of supported model formats. KServe may be a better fit when an organization standardizes on Knative-based model serving. Ray Serve is strongest when Python-native composition, heterogeneous resource allocation, and distributed orchestration are the hard parts.
┌────────────────────────────────────────────────────────────────────────────┐│ Ray Serve on Kubernetes ││ ││ Client Request ││ │ ││ ▼ ││ ┌────────────────────┐ ┌────────────────────────────────────────┐ ││ │ Kubernetes Service │───────▶│ Ray Serve HTTP Proxy │ ││ └────────────────────┘ └────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────────────────┐ ││ │ Ray Serve Application │ ││ │ │ ││ │ ┌──────────────┐ ┌──────────────┐ ┌────────────────────┐ │ ││ │ │ Preprocessor │────▶│ Retriever │────▶│ Generator │ │ ││ │ │ CPU actors │ │ CPU/GPU │ │ GPU actors │ │ ││ │ └──────────────┘ └──────────────┘ └────────────────────┘ │ ││ │ │ ││ └──────────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────────────────┐ ││ │ Ray Runtime: scheduling, object refs, actor placement, resources │ ││ └──────────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────────────────┐ ││ │ Kubernetes: pods, nodes, GPUs, services, namespaces, operator control │ ││ └──────────────────────────────────────────────────────────────────────┘ │└────────────────────────────────────────────────────────────────────────────┘The diagram is intentionally layered. Kubernetes still owns pods, nodes, services, and GPU devices. Ray owns distributed execution inside the cluster. Ray Serve owns request routing and deployment lifecycle inside Ray. Your application code owns model initialization, input validation, model calls, and response shape. Keeping those boundaries clear prevents many production incidents.
Pause and predict: If a request passes through three Ray Serve deployments and the middle deployment is slow, which part should scale first: the Kubernetes Service, the Ray head pod, the slow Serve deployment, or every worker pod? Write down your answer before continuing. The best first move is usually to scale the slow Serve deployment, because the queue forms at the stage that cannot consume work quickly enough. Scaling the Service does not add model capacity, and scaling every pod may waste resources.
| Layer | Owns | Common configuration | Failure symptom |
|---|---|---|---|
| Kubernetes | Pods, services, GPU devices, namespaces | resources, nodeSelector, tolerations, Services | Pod pending, image pull errors, GPU unavailable |
| KubeRay | Ray cluster and Ray service lifecycle | RayCluster, RayService, worker groups | Ray head unavailable, workers not joining |
| Ray runtime | Actors, tasks, object refs, placement | num_cpus, num_gpus, placement behavior | Actor pending, resource deadlock, worker crash |
| Ray Serve | HTTP routing, deployments, replicas, autoscaling | num_replicas, autoscaling_config, route prefix | Request queueing, replica unhealthy, route missing |
| Application code | Model loading, validation, inference logic | Python classes, request handlers, model clients | Exceptions, bad responses, slow model calls |
The practical lesson is that platform debugging follows the same layers. If the pod cannot start, look at Kubernetes first. If Ray workers do not join, inspect KubeRay and Ray startup logs. If a deployment has no healthy replicas, inspect Ray Serve state. If the replica is healthy but responses are wrong, inspect application code. Randomly changing autoscaling settings before locating the failing layer usually creates more noise.
Ray Architecture and KubeRay Resources
Section titled “Ray Architecture and KubeRay Resources”A Ray cluster has one head node and zero or more worker nodes. The head node runs coordination components, including the Global Control Store, the dashboard, and often the Serve controller. Worker nodes run Ray workers that execute tasks and actors. In Kubernetes, these nodes are represented by pods managed by KubeRay custom resources.
The head node should be treated as control-plane-like infrastructure for the Ray cluster. It is not the place to pack all expensive inference work unless the deployment is tiny. For production serving, the head pod normally exposes dashboard, client, and Serve ports while worker pods provide the CPU and GPU capacity for model replicas. This separation makes failures easier to reason about and keeps model resource pressure away from coordination.
┌────────────────────────────────────────────────────────────────────────────┐│ Ray Cluster ││ ││ ┌──────────────────────────────────────────────────────────────────────┐ ││ │ Head Pod │ ││ │ │ ││ │ ┌────────────────────┐ ┌────────────────────┐ ┌────────────────┐ │ ││ │ │ Global Control │ │ Ray Dashboard │ │ Serve │ │ ││ │ │ Store │ │ port 8265 │ │ Controller │ │ ││ │ └────────────────────┘ └────────────────────┘ └────────────────┘ │ ││ │ │ │ │ │ ││ │ └─────────────────────────┼────────────────────┘ │ ││ │ ▼ │ ││ │ ┌────────────────┐ │ ││ │ │ Head Raylet │ │ ││ │ └────────────────┘ │ ││ └──────────────────────────────────────────────────────────────────────┘ ││ ││ ┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────┐ ││ │ Worker Pod A │ │ Worker Pod B │ │ Worker Pod C │ ││ │ │ │ │ │ │ ││ │ ┌──────────────────┐ │ │ ┌──────────────────┐ │ │ ┌──────────────┐ │ ││ │ │ Raylet │ │ │ │ Raylet │ │ │ │ Raylet │ │ ││ │ └──────────────────┘ │ │ └──────────────────┘ │ │ └──────────────┘ │ ││ │ ┌──────────────────┐ │ │ ┌──────────────────┐ │ │ ┌──────────────┐ │ ││ │ │ CPU actors │ │ │ │ GPU actors │ │ │ │ GPU actors │ │ ││ │ └──────────────────┘ │ │ └──────────────────┘ │ │ └──────────────┘ │ ││ └──────────────────────┘ └──────────────────────┘ └──────────────────┘ │└────────────────────────────────────────────────────────────────────────────┘KubeRay exposes several resources, but two matter most for serving. RayCluster creates and manages a Ray runtime. RayService manages both a Ray cluster configuration and a Ray Serve application configuration. A platform team that wants a stable online endpoint usually prefers RayService, because it gives the operator enough information to handle service health and application updates together.
A RayCluster is still useful when learners are experimenting, running jobs, or separating cluster provisioning from application deployment. In mature platforms, the decision is often organizational as much as technical. If one platform team owns shared Ray clusters and application teams deploy Serve apps separately, RayCluster may be managed by the platform. If each serving application needs its own isolated cluster lifecycle, RayService is simpler to reason about.
| Resource | Use it when | What it manages | What it does not solve by itself |
|---|---|---|---|
RayCluster | You want an explicit Ray runtime to run jobs, notebooks, or manually deployed Serve apps | Head pod, worker groups, Ray startup parameters | Application rollout, Serve config health, route management |
RayService | You want Kubernetes to manage a Ray Serve application and its backing Ray cluster together | Ray cluster plus Serve application config | Model correctness, data validation, business retry behavior |
RayJob | You want to run batch or training work on Ray | Job submission and lifecycle | Long-running online inference endpoint behavior |
Kubernetes Service | You need stable networking to the head or Serve proxy | Cluster IP and port access | Distributed scheduling, model composition, actor placement |
The distinction becomes important during upgrades. Updating a RayCluster changes the runtime capacity or Ray version. Updating a RayService can change both runtime capacity and Serve application code. A careful rollout plan separates these concerns when the risk is high. For example, upgrade Ray images in a staging cluster before changing the model pipeline import path in production.
Installing KubeRay and Deploying a Ray Cluster
Section titled “Installing KubeRay and Deploying a Ray Cluster”The first operational step is installing the KubeRay operator. The operator watches Ray custom resources and reconciles the Kubernetes objects needed to run Ray. The commands below use kubectl for the first command and then introduce the common alias k for later commands. In this module, k means kubectl.
helm repo add kuberay https://ray-project.github.io/kuberay-helm/helm repo update
helm install kuberay-operator kuberay/kuberay-operator \ --namespace ray-system \ --create-namespace
kubectl get pods -n ray-systemalias k=kubectlThe operator installation is cluster infrastructure, so it should usually be owned by a platform team rather than an application repository. Application teams can then own RayService manifests inside their normal deployment pipeline. This split matches the usual Kubernetes pattern: platform owns controllers and policy, product teams own workload declarations.
The following RayCluster is intentionally CPU-only. A CPU-only cluster is cheaper for learning and lets you debug the Ray and Serve control path before introducing GPU scheduling. The same pattern later extends to GPU worker groups by adding nvidia.com/gpu resource limits and a Ray actor option that requests GPUs.
apiVersion: ray.io/v1kind: RayClustermetadata: name: ray-demo-cluster namespace: ray-servingspec: rayVersion: "2.9.0" headGroupSpec: rayStartParams: dashboard-host: "0.0.0.0" template: spec: containers: - name: ray-head image: rayproject/ray:2.9.0-py310 ports: - containerPort: 6379 name: gcs - containerPort: 8265 name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" workerGroupSpecs: - groupName: cpu-workers replicas: 2 minReplicas: 1 maxReplicas: 4 rayStartParams: {} template: spec: containers: - name: ray-worker image: rayproject/ray:2.9.0-py310 resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi"Apply the cluster into its own namespace so that cleanup is safe and resource ownership is clear. In a shared environment, namespace isolation also helps separate RBAC, resource quotas, and network policy. For a production platform, you would add labels, tolerations, and node selectors that match your cluster standards.
k create namespace ray-servingk apply -f ray-cluster.yamlk get rayclusters -n ray-servingk get pods -n ray-serving -l ray.io/cluster=ray-demo-clusterActive check: If the worker pods stay Pending, do not change Ray Serve settings first. Inspect Kubernetes scheduling state with k describe pod and look for CPU, memory, node selector, taint, or GPU messages. Ray cannot schedule actors onto worker pods that Kubernetes never started.
A GPU worker group adds one more scheduling layer. Kubernetes must allocate the GPU device to a pod, and Ray must allocate a GPU resource to an actor. Both sides must agree. If Kubernetes gives the pod one GPU but the Serve deployment asks Ray for two GPUs per replica, that replica will remain pending until enough Ray GPU resources exist.
apiVersion: ray.io/v1kind: RayClustermetadata: name: ray-gpu-cluster namespace: ray-servingspec: rayVersion: "2.9.0" headGroupSpec: rayStartParams: dashboard-host: "0.0.0.0" template: spec: containers: - name: ray-head image: rayproject/ray-ml:2.9.0-py310-gpu resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" workerGroupSpecs: - groupName: gpu-workers replicas: 2 minReplicas: 1 maxReplicas: 4 rayStartParams: {} template: spec: containers: - name: ray-worker image: rayproject/ray-ml:2.9.0-py310-gpu resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "1"The safest way to learn GPU serving is to validate one layer at a time. First confirm the Kubernetes device plugin exposes GPUs to pods. Next confirm Ray sees GPU resources in the dashboard. Only then deploy a Serve replica that asks for num_gpus. This sequence reduces the search space when something fails.
Building a Ray Serve Application
Section titled “Building a Ray Serve Application”A Ray Serve application begins with one or more deployments. A deployment is a Python class or function decorated with @serve.deployment. Serve creates replicas of that deployment, routes requests to those replicas, and lets deployments call each other through handles. This lets you express a multi-stage inference pipeline without manually creating separate HTTP services for every stage.
The simplest deployment looks like a normal Python request handler. It receives a Starlette Request, reads JSON, and returns a JSON-serializable object. The difference is that Ray Serve owns the lifecycle of the class instances, not a standalone ASGI server that you wrote directly.
from ray import servefrom starlette.requests import Request
@serve.deployment(num_replicas=2)class TextNormalizer: async def __call__(self, request: Request) -> dict: body = await request.json() text = body.get("text", "") normalized = " ".join(text.lower().strip().split()) return {"normalized": normalized}
app = TextNormalizer.bind()Run this application locally in an environment with Ray installed. The command imports app from the module and starts a Serve application. In Kubernetes, RayService can load the same import path, which is why the final line matters.
serve run normalizer:app --host 0.0.0.0 --port 8000Test it with a normal HTTP request. The API shape is intentionally boring because production inference endpoints should be easy for callers to use. Complexity belongs behind the endpoint, where Ray Serve can distribute work across replicas and actors.
curl http://127.0.0.1:8000 \ -H "Content-Type: application/json" \ -d '{"text": " Ray Serve Routes Requests "}'Pause and predict: What happens if you set num_replicas=2 but send only one request at a time? You should not expect a speedup for a single synchronous request. Replicas improve concurrency and availability. They help when multiple requests arrive or when one replica is busy, not when one request must pass through one serial model call.
A more realistic application separates stages. The next example uses a normalizer, a retriever, and a generator. The retriever and generator are intentionally lightweight so the code can run without downloading a large model. The structure is the important part: each deployment can have different resource settings and can scale independently.
from ray import servefrom starlette.requests import Request
DOCUMENTS = { "kubernetes": "Kubernetes schedules containers, manages desired state, and exposes services.", "ray": "Ray provides distributed Python tasks, actors, scheduling, and object references.", "serve": "Ray Serve adds HTTP routing, replicas, autoscaling, and deployment composition.",}
@serve.deployment(ray_actor_options={"num_cpus": 0.25})class Normalizer: def normalize(self, query: str) -> str: return " ".join(query.lower().strip().split())
@serve.deployment(ray_actor_options={"num_cpus": 0.5})class Retriever: def retrieve(self, query: str) -> str: for key, value in DOCUMENTS.items(): if key in query: return value return "No matching document was found."
@serve.deployment(ray_actor_options={"num_cpus": 0.5})class Generator: def generate(self, query: str, context: str) -> str: return f"Question: {query}\nAnswer from context: {context}"
@serve.deploymentclass RagPipeline: def __init__(self, normalizer, retriever, generator): self.normalizer = normalizer self.retriever = retriever self.generator = generator
async def __call__(self, request: Request) -> dict: body = await request.json() query = body.get("query", "")
clean_query = await self.normalizer.normalize.remote(query) context = await self.retriever.retrieve.remote(clean_query) answer = await self.generator.generate.remote(clean_query, context)
return { "query": query, "normalized_query": clean_query, "context": context, "answer": answer, }
normalizer = Normalizer.bind()retriever = Retriever.bind()generator = Generator.bind()app = RagPipeline.bind(normalizer, retriever, generator)The .bind() calls create the deployment graph. The .remote() calls execute methods through Ray handles and return awaitable object references inside the Serve application. This is where Ray Serve differs from a chain of HTTP microservices. You are composing distributed Python actors inside one serving application, not writing service discovery and serialization code for every internal hop.
A production version would add request validation, timeouts, model warmup, observability, and error handling. It might also use a real vector database and a real language model. The learning version keeps the model logic small so that the execution model is visible. When the serving graph is clear, replacing the toy generator with vLLM or another model backend becomes a controlled change.
Worked Example: Choosing Resources for a Multi-Stage Pipeline
Section titled “Worked Example: Choosing Resources for a Multi-Stage Pipeline”Consider a support-ticket assistant. It receives a customer ticket, strips noisy formatting, retrieves related runbooks, reranks the candidate documents, and asks a generator to draft a response. The stages do not have equal cost. Normalization is cheap CPU work. Retrieval is moderate CPU or network work. Reranking may use a small GPU model. Generation may need a larger GPU allocation.
The goal is not to give every stage the same replica count. The goal is to put capacity where the queue forms. If normalization takes ten milliseconds and generation takes two seconds, adding ten normalizer replicas will not fix generation latency. A good platform design starts with the bottleneck and assigns resources from there.
┌────────────────────────────────────────────────────────────────────────────┐│ Support Ticket Assistant ││ ││ Incoming Ticket ││ │ ││ ▼ ││ ┌──────────────────┐ cheap CPU ┌──────────────────┐ network/CPU ││ │ Normalize Text │───────────────▶│ Retrieve Docs │────────────────┐ ││ │ 2 replicas │ │ 4 replicas │ │ ││ └──────────────────┘ └──────────────────┘ │ ││ ▼ ││ ┌──────────────────┐ small GPU ┌──────────────────┐ larger GPU ││ │ Rerank Docs │───────────────▶│ Generate Answer │────────────────┘ ││ │ 2 replicas │ │ 1-3 replicas │ ││ └──────────────────┘ └──────────────────┘ ││ ││ Capacity decision: scale the stage whose queue grows under representative ││ traffic, then verify that upstream and downstream stages remain healthy. │└────────────────────────────────────────────────────────────────────────────┘A worked allocation might begin with small CPU settings for normalization, more replicas for retrieval, fractional GPU for reranking, and full GPU for generation. This is not a universal recipe. It is a hypothesis that should be validated with load tests and dashboard metrics. The point is that Ray Serve lets you express different resource shapes inside one composed application.
from ray import serve
@serve.deployment( num_replicas=2, ray_actor_options={"num_cpus": 0.25},)class TicketNormalizer: def normalize(self, ticket: str) -> str: return " ".join(ticket.replace("\n", " ").lower().split())
@serve.deployment( autoscaling_config={ "min_replicas": 2, "max_replicas": 6, "target_num_ongoing_requests_per_replica": 8, }, ray_actor_options={"num_cpus": 0.5},)class RunbookRetriever: def retrieve(self, ticket: str) -> list[str]: return [ "Check recent deployment events.", "Compare failing namespace limits with requested resources.", "Inspect application logs around the first failing request.", ]
@serve.deployment( num_replicas=2, ray_actor_options={"num_gpus": 0.25},)class Reranker: def rerank(self, ticket: str, docs: list[str]) -> list[str]: return sorted(docs, key=len, reverse=True)
@serve.deployment( autoscaling_config={ "min_replicas": 1, "max_replicas": 3, "target_num_ongoing_requests_per_replica": 2, "upscale_delay_s": 15, "downscale_delay_s": 120, }, ray_actor_options={"num_gpus": 1},)class AnswerGenerator: def generate(self, ticket: str, docs: list[str]) -> str: joined = " ".join(docs) return f"Draft response for ticket '{ticket[:80]}': {joined}"Now reason about the allocation. The retriever can scale wider because it is cheaper and has a higher concurrency target. The generator has a lower queue target because each request is expensive and user-visible latency rises quickly when generation queues grow. The reranker uses fractional GPU because a smaller model may not need an entire accelerator, but that choice must be verified under realistic traffic.
Active check: Suppose the dashboard shows generator replicas are healthy, GPU utilization is high, and num_ongoing_requests for the generator keeps rising. Which setting would you evaluate first? A strong answer is to increase max_replicas if the cluster has more GPU capacity, reduce per-request generation cost if it does not, or lower the queue target if latency is more important than throughput. Increasing retriever replicas would not address the bottleneck.
| Stage | Resource shape | Scaling signal | First tuning move |
|---|---|---|---|
| Normalizer | Small CPU | CPU saturation or request backlog | Add modest replicas only if it queues |
| Retriever | CPU plus network | Ongoing requests and external latency | Scale replicas and watch downstream pressure |
| Reranker | Fractional GPU | GPU utilization and method latency | Validate fractional packing under load |
| Generator | Full GPU | Queue depth, latency, GPU utilization | Tune max_replicas, batching, and output length |
| Pipeline router | Small CPU | HTTP errors and handler latency | Keep logic thin and avoid blocking calls |
The worked example also shows why application design and platform design must meet. If the generator accepts unlimited prompt length, no autoscaler can fully protect latency. If the retriever sometimes blocks on a slow external store, the pipeline needs timeouts and graceful error responses. Ray Serve gives the infrastructure tools, but the application must still behave like a production service.
Deploying Ray Serve with RayService
Section titled “Deploying Ray Serve with RayService”A RayService packages the Ray cluster configuration and the Serve application configuration into one Kubernetes resource. This is usually the right abstraction for a production endpoint that should survive restarts and be reconciled by the operator. The serveConfigV2 block describes the application name, route prefix, import path, runtime environment, and Serve deployment settings.
The example below assumes your application code is available to the Ray runtime. In production, that usually means building it into the image, using a versioned package, or pointing runtime_env at a controlled artifact. Copying files into a running pod is acceptable for a lab but not for repeatable production delivery.
apiVersion: ray.io/v1kind: RayServicemetadata: name: support-assistant namespace: ray-servingspec: serviceUnhealthySecondThreshold: 900 deploymentUnhealthySecondThreshold: 300 serveConfigV2: | applications: - name: support-assistant route_prefix: / import_path: support_app:app runtime_env: pip: - starlette deployments: - name: TicketNormalizer num_replicas: 2 ray_actor_options: num_cpus: 0.25 - name: RunbookRetriever autoscaling_config: min_replicas: 2 max_replicas: 6 target_num_ongoing_requests_per_replica: 8 ray_actor_options: num_cpus: 0.5 - name: Reranker num_replicas: 2 ray_actor_options: num_gpus: 0.25 - name: AnswerGenerator autoscaling_config: min_replicas: 1 max_replicas: 3 target_num_ongoing_requests_per_replica: 2 ray_actor_options: num_gpus: 1 rayClusterConfig: rayVersion: "2.9.0" headGroupSpec: rayStartParams: dashboard-host: "0.0.0.0" template: spec: containers: - name: ray-head image: rayproject/ray-ml:2.9.0-py310-gpu ports: - containerPort: 8265 name: dashboard - containerPort: 8000 name: serve resources: requests: cpu: "2" memory: "8Gi" limits: cpu: "4" memory: "16Gi" workerGroupSpecs: - groupName: gpu-workers replicas: 2 minReplicas: 1 maxReplicas: 4 rayStartParams: {} template: spec: containers: - name: ray-worker image: rayproject/ray-ml:2.9.0-py310-gpu resources: requests: cpu: "4" memory: "16Gi" nvidia.com/gpu: "1" limits: cpu: "8" memory: "32Gi" nvidia.com/gpu: "1"The most important field in the Serve configuration is import_path. It tells Ray Serve what Python object to import and run. If the import path is wrong, Kubernetes may show healthy pods while Ray Serve has no healthy application. This is a common source of confusion because cluster health and application health are related but not identical.
Apply and inspect the resource through Kubernetes first. Then move inward to Ray-specific state. This outside-in debugging pattern keeps you from assuming Ray is broken when the manifest never reconciled, or assuming Kubernetes is broken when the Python import failed.
k apply -f support-assistant-rayservice.yamlk get rayservices -n ray-servingk describe rayservice support-assistant -n ray-servingk get pods -n ray-serving -l ray.io/cluster=support-assistant-rayclusterAfter the pods are running, forward the Serve port and test the route. Use 127.0.0.1 for local testing so the command is explicit about the loopback address. In a real environment, you would expose the service through the platform’s ingress, gateway, or internal service mesh policy.
k port-forward svc/support-assistant-head-svc 8000:8000 -n ray-serving
curl http://127.0.0.1:8000 \ -H "Content-Type: application/json" \ -d '{"ticket": "Deployment failed after a new image rollout. Pods restart repeatedly."}'If the request fails, read the error at the right layer. A connection failure suggests Service, port-forward, or pod readiness. A route not found suggests Serve routing or application import. A JSON parsing error suggests request shape or application code. A long delay suggests model initialization, queueing, or insufficient replicas.
Autoscaling, Batching, and GPU Allocation
Section titled “Autoscaling, Batching, and GPU Allocation”Ray Serve autoscaling is based on ongoing requests per replica. This matters because online inference often queues before CPU or GPU utilization looks obviously saturated. A replica running a large model may be fully committed with only a few active requests. A lightweight preprocessor may handle many concurrent requests before becoming the bottleneck.
The main autoscaling setting is target_num_ongoing_requests_per_replica. A lower target protects latency by scaling sooner, but it can consume more resources. A higher target improves resource efficiency, but it can increase tail latency. There is no perfect default; the right value depends on model cost, request size, latency objective, and cold-start behavior.
from ray import serve
@serve.deployment( autoscaling_config={ "min_replicas": 1, "max_replicas": 5, "target_num_ongoing_requests_per_replica": 4, "upscale_delay_s": 10, "downscale_delay_s": 90, }, ray_actor_options={"num_cpus": 1},)class AutoscaledClassifier: def classify(self, text: str) -> dict: label = "urgent" if "outage" in text.lower() else "normal" return {"label": label}Batching is a separate lever. Autoscaling adds replicas; batching makes each replica process multiple compatible requests together. Batching is useful when the model backend gains efficiency from larger batches, especially on GPUs. It is harmful when batching adds wait time without improving throughput, or when requests have highly variable sizes that cause one large request to delay many small ones.
from ray import serve
@serve.deployment(ray_actor_options={"num_cpus": 1})class BatchedEmbedder: @serve.batch(max_batch_size=8, batch_wait_timeout_s=0.05) async def embed_batch(self, texts: list[str]) -> list[list[float]]: return [[float(len(text)), float(text.count(" "))] for text in texts]
async def embed(self, text: str) -> list[float]: return await self.embed_batch(text)Fractional GPU allocation is a Ray scheduling feature, not a magic memory isolation boundary. Setting num_gpus: 0.25 tells Ray’s scheduler that four such actors may be placed on one GPU. It does not guarantee that the models will fit in GPU memory or that their kernels will not interfere with each other. Always validate fractional packing with the real model, real batch sizes, and real traffic.
| Setting | What it controls | Good use | Risk if misused |
|---|---|---|---|
num_replicas | Fixed number of deployment replicas | Stable, predictable capacity | Overprovisioning or underprovisioning |
min_replicas | Lower bound for autoscaling | Keep warm capacity available | Idle cost if set too high |
max_replicas | Upper bound for autoscaling | Protect cluster capacity | Queue growth if set too low |
target_num_ongoing_requests_per_replica | Queue target per replica | Balance latency and utilization | Tail latency or resource waste |
upscale_delay_s | Delay before scale-up | Avoid reacting to tiny spikes | Slow response to sudden load |
downscale_delay_s | Delay before scale-down | Avoid thrashing and cold starts | Longer idle resource usage |
ray_actor_options.num_gpus | Ray GPU scheduling request | Match model accelerator needs | Pending replicas or GPU contention |
@serve.batch | Per-replica request batching | Improve model throughput | Added wait time and uneven latency |
A reliable tuning process changes one lever at a time. Start with a fixed small replica count and observe baseline latency. Add autoscaling when you know which stage queues. Add batching only when the backend benefits from batches. Add fractional GPUs only after measuring memory and latency isolation. Production performance work is a measurement loop, not a search for a universal YAML snippet.
Monitoring and Debugging Ray Serve
Section titled “Monitoring and Debugging Ray Serve”Monitoring Ray Serve requires three views. Kubernetes tells you whether pods and services exist. The Ray dashboard tells you whether Ray resources, actors, and Serve deployments are healthy. Application metrics and logs tell you whether model behavior is correct. Senior platform engineers move across all three instead of treating one dashboard as the whole truth.
Forward the dashboard when debugging a lab cluster. In production, access should go through your organization’s secure access path rather than an unauthenticated local tunnel. The dashboard is useful for seeing cluster resources, actor placement, Serve deployment health, and logs.
k port-forward svc/ray-demo-cluster-head-svc 8265:8265 -n ray-servingOpen http://127.0.0.1:8265 after the port-forward is active. Look for resource totals first. If Ray reports zero GPUs but Kubernetes scheduled GPU worker pods, the problem is likely inside the image, device plugin integration, or Ray startup environment. If Ray reports GPUs but Serve replicas are pending, compare each deployment’s num_gpus request with available resources.
Prometheus metrics provide the long-term operational view. Names can vary by Ray version and configuration, so always verify the metrics exposed in your environment. The useful categories remain stable: request latency, request count, replica health, ongoing requests, and cluster resource utilization.
histogram_quantile( 0.99, sum(rate(ray_serve_request_latency_ms_bucket[5m])) by (le, deployment))
sum(rate(ray_serve_num_requests_total[5m])) by (deployment)
sum(ray_serve_num_ongoing_requests) by (deployment)
avg(ray_node_gpu_utilization) by (node)Use symptoms to choose the next command. If pods are not ready, stay with k get pods, k describe pod, and container logs. If pods are healthy but Serve is failing, inspect RayService conditions and Ray dashboard Serve state. If Serve is healthy but latency is high, inspect ongoing requests, replica counts, model logs, and downstream dependencies.
k get rayservices -n ray-servingk describe rayservice support-assistant -n ray-servingk logs -n ray-serving -l ray.io/node-type=head --tail=100k logs -n ray-serving -l ray.io/node-type=worker --tail=100| Symptom | Likely layer | First checks | Common fix |
|---|---|---|---|
Worker pod is Pending | Kubernetes scheduling | k describe pod, quotas, node labels, GPU availability | Adjust requests, labels, tolerations, or capacity |
| Ray worker pod runs but does not join | KubeRay or Ray startup | Worker logs, head service DNS, Ray version mismatch | Fix image, networking, or Ray startup parameters |
| Serve route returns not found | Ray Serve routing | RayService status, import path, route prefix | Correct serveConfigV2 and import path |
| Replica stays unhealthy | Application or resource allocation | Replica logs, model load errors, actor resources | Fix dependencies, image, resource requests, or model path |
| Latency rises during traffic | Serve scaling or model bottleneck | Ongoing requests, latency metrics, GPU utilization | Tune replicas, batching, queue target, or model cost |
| GPU exists but actor is pending | Ray resource scheduling | Dashboard resources and num_gpus settings | Match actor GPU requests to available Ray resources |
The most dangerous debugging shortcut is changing many settings at once. For example, increasing worker replicas, changing batch size, and modifying max_replicas in the same rollout makes it hard to know which change helped or hurt. Treat the serving system like any other distributed system: isolate the symptom, form a hypothesis, make one change, and verify the result.
Ray Serve Versus Alternatives
Section titled “Ray Serve Versus Alternatives”Ray Serve sits in a crowded model-serving landscape. Choosing it well requires comparing workload shape, not just feature lists. If your service is one optimized model with strict latency targets and a supported runtime format, Triton may be a strong fit. If your organization already built platform standards around KServe and Knative, operational consistency may matter more than Python-native composition.
Ray Serve becomes attractive when the serving application is a distributed Python program. RAG pipelines, multi-model document processing, agentic workflows, and training-to-serving workflows often need more than a single model invocation. Ray Serve lets those stages live in one application graph while still scaling independently. That can reduce glue code and make pipeline-level behavior easier to reason about.
| Feature | Ray Serve | Triton | Seldon Core | KServe |
|---|---|---|---|---|
| Primary strength | Python-native distributed serving | Optimized inference runtime | Enterprise model serving patterns | Kubernetes-native model serving |
| Model composition | Python handles and deployment graphs | Ensembles and backend configuration | Graph-style inference services | Inference services and pipelines |
| Best workload shape | Multi-stage AI applications | Low-latency supported model formats | Governed enterprise serving | Standardized K8s model endpoints |
| Scaling model | Serve replicas on Ray resources | Usually external orchestration | Kubernetes and platform integrations | Kubernetes and Knative patterns |
| GPU sharing approach | Ray fractional resource scheduling | Runtime and deployment dependent | Platform dependent | Platform dependent |
| Developer workflow | Python classes and imports | Model repository and config | CRDs and model server patterns | CRDs and model server patterns |
| Operational trade-off | Adds Ray runtime to operate | Runtime tuning can be specialized | More platform components | Knative and KServe complexity |
A good decision document should include a rejected-options section. For Ray Serve, justify why the distributed Python graph is worth operating Ray. For Triton, justify why model runtime optimization is more important than Python-native orchestration. For KServe, justify why platform standardization and Kubernetes-native workflows outweigh Ray’s composition model. Tool choice becomes stronger when it names the trade-off.
Active check: Your team serves one ONNX model with strict p99 latency, no Python pipeline, and a platform team already experienced with GPU runtime tuning. Would Ray Serve be your default recommendation? A careful answer is probably no. Ray Serve can serve the model, but Triton or another specialized runtime may be a better first evaluation target because the workload does not need Ray’s composition strengths.
Did You Know?
Section titled “Did You Know?”- Ray started as a distributed systems project at UC Berkeley’s RISELab, which helps explain why its core abstractions focus on distributed tasks, actors, and scheduling.
- Ray Serve applications can compose deployments with Python handles, so internal stages do not need to communicate through separate HTTP services.
- Fractional GPU settings in Ray are scheduling requests, so teams must still validate GPU memory pressure and latency behavior under real model load.
- KubeRay provides Kubernetes custom resources such as
RayCluster,RayJob, andRayServiceso Ray workloads can be reconciled through Kubernetes control loops.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Better approach |
|---|---|---|
| Treating Ray Serve like a single-process web framework | The team misses actor placement, resource scheduling, and distributed failure modes | Debug by layer: Kubernetes, KubeRay, Ray runtime, Serve deployment, application code |
| Putting expensive model replicas on the head pod | Control-plane coordination can compete with inference work | Keep the head focused on coordination and use worker groups for model capacity |
| Asking Ray for more GPUs than Kubernetes gives the pods | Serve replicas remain pending even though pods are running | Match Kubernetes GPU limits with ray_actor_options.num_gpus and dashboard resources |
| Using fractional GPUs without load testing | Multiple actors may fit by scheduler accounting but fail on memory or latency | Validate fractional packing with real models, batch sizes, and traffic patterns |
| Scaling every stage equally | Cheap stages overprovision while expensive stages keep queueing | Scale the stage whose ongoing-request queue and latency indicate bottleneck behavior |
| Copying code into pods for production delivery | Rollouts become manual and unreproducible | Package application code in an image, wheel, or controlled runtime artifact |
Ignoring import-path failures in RayService | Kubernetes pods appear healthy while Serve has no usable application | Check RayService conditions, Serve logs, and the Python import path together |
| Changing autoscaling, batching, and resources in one rollout | The team cannot identify which change affected latency or cost | Tune one lever at a time and compare metrics before and after each change |
Question 1
Section titled “Question 1”Your team deploys a RayService for a RAG endpoint. Kubernetes shows the head and worker pods as Running, but HTTP requests return a route-not-found response. What should you check first, and why?
Show Answer
Start with the RayService status, Serve application state, route prefix, and import_path. Running pods prove the Ray cluster exists, but they do not prove the Serve application imported correctly or registered the expected route. This is a Ray Serve application-layer symptom, not a pod scheduling symptom.
Question 2
Section titled “Question 2”A generator deployment uses ray_actor_options: {"num_gpus": 1} and autoscaling allows four replicas. The Ray dashboard reports only two GPUs in the cluster, and two replicas remain pending during load. What design correction would you recommend?
Show Answer
Either increase available GPU worker capacity or reduce the generator’s maximum replica count to match the real GPU budget. The pending replicas are a resource scheduling mismatch: Ray cannot place four one-GPU actors on a cluster that exposes only two GPU resources.
Question 3
Section titled “Question 3”A pipeline has a fast normalizer, a moderate retriever, and a slow generator. Latency rises during traffic, and metrics show generator ongoing requests increasing while normalizer replicas are mostly idle. What should you change first?
Show Answer
Tune the generator stage first by evaluating max_replicas, queue target, batching, output length, or model cost. Scaling the normalizer will not reduce the queue forming at the generator. The bottleneck should drive the first capacity change.
Question 4
Section titled “Question 4”Your team wants to use num_gpus: 0.25 for four small reranker replicas on one GPU. The service starts, but p99 latency becomes unstable under realistic traffic. What does this reveal about fractional GPU allocation?
Show Answer
Fractional GPU allocation is scheduler accounting, not a guarantee of predictable performance isolation. The models may fit on the device but still compete for memory bandwidth, compute, or kernel execution. The team should test fewer colocated replicas, adjust batch behavior, or allocate larger GPU fractions.
Question 5
Section titled “Question 5”A platform team already runs KServe for simple single-model endpoints. A product team proposes Ray Serve for a new document workflow that includes OCR, retrieval, reranking, and generation with different resource needs. How would you evaluate the proposal?
Show Answer
Compare workload shape against operational cost. Ray Serve is a strong candidate because the workflow is a multi-stage Python application with heterogeneous resources and independent scaling needs. The decision should still account for the extra Ray runtime, platform support model, and whether existing KServe standards can meet the same requirements.
Question 6
Section titled “Question 6”A learner increases target_num_ongoing_requests_per_replica from two to twelve on an expensive LLM deployment because they want better GPU utilization. Throughput improves slightly, but user-facing latency becomes unacceptable. What trade-off did they expose?
Show Answer
They raised the per-replica queue target, which can improve utilization but allows more requests to wait behind expensive inference work. For latency-sensitive generation, the team may need a lower queue target, more replicas, shorter outputs, batching changes, or a different model size.
Question 7
Section titled “Question 7”After a Ray image upgrade, the RayCluster reconciles but Serve replicas fail during startup with dependency import errors. Kubernetes resource settings did not change. What is the most likely category of failure, and what evidence should you gather?
Show Answer
This is likely an application runtime or image dependency failure. Gather Serve replica logs, Python import errors, runtime_env details, and image contents. Kubernetes scheduling succeeded, so changing pod resources is unlikely to fix missing packages or incompatible model dependencies.
Question 8
Section titled “Question 8”A team copies rag_app.py into the Ray head pod during a demo and starts serve run manually. The demo works, but the endpoint disappears after a pod restart. What production change aligns the deployment with Kubernetes operations?
Show Answer
Package the application code into a versioned image, wheel, or controlled runtime artifact and deploy it through RayService with a stable import_path. Manual file copies and interactive commands are not reconciled desired state, so they vanish when pods restart.
Hands-On Exercise: Build and Diagnose a Ray Serve RAG Pipeline
Section titled “Hands-On Exercise: Build and Diagnose a Ray Serve RAG Pipeline”Objective: Deploy a small Ray Serve RAG-style pipeline on Kubernetes, verify the request path, and diagnose one intentional scaling question using the same layer-by-layer method used in production.
Scenario
Section titled “Scenario”Your platform team is evaluating Ray Serve for an internal support assistant. The first version does not call a real LLM because the goal is to validate serving architecture before spending GPU budget. The pipeline normalizes a support question, retrieves a matching context string, and generates a templated answer. After it works, you will inspect where autoscaling would matter if the generator became expensive.
Step 1: Create the Namespace and Confirm the Operator
Section titled “Step 1: Create the Namespace and Confirm the Operator”Create a namespace for the lab. If the KubeRay operator is already installed, do not reinstall it. If it is missing, install it with Helm before continuing.
k create namespace ray-lab
k get pods -n ray-system
helm repo add kuberay https://ray-project.github.io/kuberay-helm/helm repo update
helm install kuberay-operator kuberay/kuberay-operator \ --namespace ray-system \ --create-namespaceStep 2: Deploy a CPU Ray Cluster
Section titled “Step 2: Deploy a CPU Ray Cluster”Save the following manifest as ray-lab-cluster.yaml. This cluster uses CPU workers so the exercise can run in more environments. The same structure later supports GPU worker groups when you are ready to test accelerator-backed deployments.
apiVersion: ray.io/v1kind: RayClustermetadata: name: ray-lab-cluster namespace: ray-labspec: rayVersion: "2.9.0" headGroupSpec: rayStartParams: dashboard-host: "0.0.0.0" template: spec: containers: - name: ray-head image: rayproject/ray:2.9.0-py310 ports: - containerPort: 6379 name: gcs - containerPort: 8265 name: dashboard - containerPort: 10001 name: client - containerPort: 8000 name: serve resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi" workerGroupSpecs: - groupName: cpu-workers replicas: 2 minReplicas: 1 maxReplicas: 3 rayStartParams: {} template: spec: containers: - name: ray-worker image: rayproject/ray:2.9.0-py310 resources: requests: cpu: "1" memory: "2Gi" limits: cpu: "2" memory: "4Gi"Apply it and wait for the pods. If the wait command times out, inspect pod events before changing the manifest.
k apply -f ray-lab-cluster.yamlk get rayclusters -n ray-labk get pods -n ray-lab -l ray.io/cluster=ray-lab-clusterk wait --for=condition=ready pod -l ray.io/cluster=ray-lab-cluster -n ray-lab --timeout=300sStep 3: Create the RAG Application
Section titled “Step 3: Create the RAG Application”Save the following file as rag_app.py. The code uses three deployments and one composed pipeline. Notice that the resource settings differ by stage even though the model logic is small.
from ray import servefrom starlette.requests import Request
KNOWLEDGE_BASE = { "kubernetes": "Kubernetes manages desired state for containerized workloads.", "ray": "Ray distributes Python tasks and actors across a cluster.", "serve": "Ray Serve routes HTTP requests to distributed deployment replicas.", "gpu": "GPU scheduling requires both Kubernetes device allocation and Ray resource accounting.",}
@serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 0.25})class Normalizer: def normalize(self, query: str) -> str: return " ".join(query.lower().strip().split())
@serve.deployment(num_replicas=2, ray_actor_options={"num_cpus": 0.5})class Retriever: def retrieve(self, query: str) -> str: for key, value in KNOWLEDGE_BASE.items(): if key in query: return value return "No matching context found."
@serve.deployment( autoscaling_config={ "min_replicas": 1, "max_replicas": 3, "target_num_ongoing_requests_per_replica": 4, }, ray_actor_options={"num_cpus": 0.5},)class Generator: def generate(self, query: str, context: str) -> str: return f"Using context '{context}', answer the question: {query}"
@serve.deploymentclass RagPipeline: def __init__(self, normalizer, retriever, generator): self.normalizer = normalizer self.retriever = retriever self.generator = generator
async def __call__(self, request: Request) -> dict: body = await request.json() query = body.get("query", "")
clean_query = await self.normalizer.normalize.remote(query) context = await self.retriever.retrieve.remote(clean_query) answer = await self.generator.generate.remote(clean_query, context)
return { "query": query, "normalized_query": clean_query, "context": context, "answer": answer, }
normalizer = Normalizer.bind()retriever = Retriever.bind()generator = Generator.bind()app = RagPipeline.bind(normalizer, retriever, generator)Step 4: Run the Application on the Cluster
Section titled “Step 4: Run the Application on the Cluster”For a lab, copy the file into the head pod and run Serve from there. This is not the production delivery method, but it keeps the exercise focused on Ray Serve behavior. Production delivery should package code into an image or artifact and reconcile it through RayService.
HEAD_POD=$(k get pod -n ray-lab -l ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}')
k cp rag_app.py "ray-lab/${HEAD_POD}:/tmp/rag_app.py"
k exec -n ray-lab "${HEAD_POD}" -- bash -lc "cd /tmp && serve run rag_app:app --host 0.0.0.0 --port 8000"If your terminal remains attached, open a second terminal for the next commands. If the command exits, read the error and classify it. Import errors belong to the Python environment. Route errors belong to Serve configuration. Pod errors belong to Kubernetes or Ray startup.
Step 5: Send Requests Through the Serve Endpoint
Section titled “Step 5: Send Requests Through the Serve Endpoint”Forward the Serve port and call the endpoint. Try at least two queries: one that matches the knowledge base and one that does not. The difference helps confirm that the request reached the retriever and generator rather than returning a static response.
k port-forward svc/ray-lab-cluster-head-svc 8000:8000 -n ray-labcurl http://127.0.0.1:8000 \ -H "Content-Type: application/json" \ -d '{"query": "How does Ray Serve route requests?"}'
curl http://127.0.0.1:8000 \ -H "Content-Type: application/json" \ -d '{"query": "What should I check when GPU scheduling fails?"}'Step 6: Inspect the Dashboard and Reason About Scaling
Section titled “Step 6: Inspect the Dashboard and Reason About Scaling”Forward the dashboard and inspect the Serve application. Look for the three deployments, their replica counts, and the cluster resources. Then answer the diagnostic question before changing any settings: if the generator became a real LLM call and started queueing, which deployment would you tune first?
k port-forward svc/ray-lab-cluster-head-svc 8265:8265 -n ray-labOpen http://127.0.0.1:8265 and inspect the Serve page. The correct reasoning is that the generator should be tuned first if it is the stage whose queue grows. You might increase max_replicas, lower target_num_ongoing_requests_per_replica, add batching if the backend benefits, reduce output length, or allocate GPU resources in a GPU-enabled cluster.
Step 7: Convert the Lab Learning into a Production Plan
Section titled “Step 7: Convert the Lab Learning into a Production Plan”Write a short production plan for the same pipeline. Include how code would be packaged, which stages would use CPU or GPU, what metrics would drive autoscaling, and what failure layer each alert would point to. This written plan is part of the exercise because senior platform work is not only applying manifests; it is explaining why the chosen operational boundary is maintainable.
Success Criteria
Section titled “Success Criteria”- A
ray-labnamespace exists and contains a running Ray head pod and worker pods. - The RAG application code defines at least three Serve deployments and composes them with
.bind(). - The endpoint returns different context for at least two different query inputs.
- The Ray dashboard shows the Serve application and deployment replicas.
- You can explain why a generator queue should be fixed at the generator stage before scaling unrelated stages.
- You can identify whether a failure belongs first to Kubernetes, KubeRay, Ray runtime, Ray Serve, or application code.
- You wrote a production plan that replaces manual
k cpdelivery with a versioned image, package, or controlled runtime artifact.
Cleanup
Section titled “Cleanup”Delete the lab namespace when finished. This removes the Ray cluster, worker pods, and any objects created inside the namespace.
k delete namespace ray-labNext Module
Section titled “Next Module”Continue to Module 9.6: LangChain & LlamaIndex to learn about building LLM applications with frameworks for RAG, agents, and chains.
Sources
Section titled “Sources”- github.com: kuberay — The KubeRay upstream README explicitly describes the operator and its RayCluster, RayJob, and RayService CRDs.
- ray-project/ray — Upstream repository and README for Ray Core and the Ray library ecosystem, including Serve.
- raw.githubusercontent.com: index.md — The Ray Serve index describes Serve as scalable and programmable serving and discusses combining multiple deployments into one application.
- raw.githubusercontent.com: key concepts.md — The Ray Serve key concepts page directly defines deployments, replicas,
@serve.deployment, applications, HTTP route prefixes, and DeploymentHandles. - raw.githubusercontent.com: model composition.md — The model composition guide describes composing deployments with
.bind()and DeploymentHandle calls inside a Serve application. - raw.githubusercontent.com: autoscaling guide.md — The Ray 2.9 autoscaling guide explicitly defines
target_num_ongoing_requests_per_replicaand the autoscaling replica bounds used in the module. - raw.githubusercontent.com: resource allocation.md — The Ray Serve resource-allocation guide directly documents
ray_actor_options,num_gpus,num_cpus, and fractional resources. - raw.githubusercontent.com: dyn req batch.md — The Ray Serve dynamic batching guide documents the
ray.serve.batchdecorator and the batch size and wait-time parameters. - raw.githubusercontent.com: fractional gpu.md — The current Ray fractional-GPU guidance explicitly tells operators to validate throughput and latency with the actual workload before production.
- raw.githubusercontent.com: getting started.md — The KubeRay getting-started page lists the CRDs and defines RayService as RayCluster plus Ray Serve deployment graphs.
- raw.githubusercontent.com: rayservice.md — The RayService guide directly documents
serveConfigV2,import_path,route_prefix, runtime environment, deployments, and KubeRay’s submission flow. - Schedule GPUs — Backs Kubernetes GPU scheduling semantics, including vendor device plugin prerequisites, extended GPU resources, and how pods request GPU resources via limits.
- Device Plugins — Backs the Kubernetes device plugin architecture, including kubelet registration, gRPC interfaces, ListAndWatch/Allocate behavior, and vendor-specific hardware exposure.
- raw.githubusercontent.com: key concepts.rst — The Ray cluster key concepts page directly describes head nodes, worker nodes, GCS, autoscaler, and user task/actor execution.
- raw.githubusercontent.com: getting started.rst — The Ray observability guide documents the dashboard views for tasks, actors, logs, Serve applications, cluster resources, and GPU assignment.
- github.com: server — The Triton upstream repository describes its inference-server role, supported frameworks, model repository workflow, and ensemble/model-configuration features.
- github.com: kserve — The KServe upstream repository describes KServe as a Kubernetes inference platform and documents InferenceService and Knative installation integration.
- github.com: seldon core — The Seldon Core upstream repository describes it as an MLOps and LLMOps framework for deploying, managing, and scaling AI systems in Kubernetes.
- Ray Serve: Scalable and Programmable Serving — Primary upstream entry point for Ray Serve concepts, composition, and serving model.
- KubeRay RayService Guide — Directly explains how RayService combines RayCluster configuration with Serve application configuration on Kubernetes.
- Ray: A Distributed Framework for Emerging AI Applications — Foundational Ray paper explaining the task and actor abstractions behind Ray Serve.