Challenge catalog

Kubernetes LLM challenges with evidence-based checks.

Start with free guided labs. Each challenge has objectives, commands, expected signals, paste-output validation, progressive hints, and a final readiness check.

Challenge of the Week

vLLM Inference Challenge

Deploy a GPU-backed OpenAI-compatible endpoint and prove scheduling, health, TTFT, queueing, and rollback readiness.

Hard75 minModel servingAI infrastructure engineer

Start challenge Read guide Follow roadmap

Challenge catalog

Operator lab index

12/12 visible

Topic

Difficulty

Path

IDChallengeDifficultyTimeProgress statusPersona / toolsActions

Model servingvllm-production-serving / production-readiness-ai

vLLM Inference Challenge

Deploy a GPU-backed OpenAI-compatible endpoint and prove scheduling, health, TTFT, queueing, and rollback readiness.

DifficultyHard

Time75 min

Progress statusnot started

Hard

75 min

not started

AI infrastructure engineer

kubectl + vLLM + Prometheus

Open lab Docs

RAGrag-platform-engineering

RAG Retrieval Challenge

Operate ingestion, metadata filters, vector retrieval, answer evaluation, and failure drills for production RAG.

DifficultyMedium

Time60 min

Progress statusnot started

Medium

60 min

not started

MLOps engineer

kubectl + curl + vector database

Open lab Docs

Productionproduction-readiness-ai / kubernetes-llm-foundations

Production Readiness Challenge

Run a launch review across security, quota, rollout, observability, cost, and ownership before live traffic.

DifficultyHard

Time50 min

Progress statusnot started

Hard

50 min

not started

Platform lead

kubectl + policy engine + dashboard

Open lab Docs

Observabilityllm-observability-cost / production-readiness-ai

LLM Observability Challenge

Build the signal model needed to debug user latency, runtime saturation, GPU pressure, traces, logs, and alerts.

DifficultyMedium

Time45 min

Progress statusnot started

Medium

45 min

not started

SRE

Prometheus + Grafana + OpenTelemetry

Open lab Docs

Model servingvllm-production-serving

vLLM Kubernetes Deployment Lab

Design the deployment contract for vLLM with model cache, readiness, runtime flags, and service exposure.

DifficultyMedium

Time55 min

Progress statusnot started

Medium

55 min

not started

AI infrastructure engineer

kubectl + vLLM + container registry

Open lab Docs

Architecturekubernetes-llm-foundations / vllm-production-serving

KServe vs Ray Serve Decision Lab

Choose the serving abstraction by ownership model, CRDs, graph complexity, autoscaling, and rollout needs.

DifficultyMedium

Time35 min

Progress statusnot started

Medium

35 min

not started

Platform architect

decision matrix + runtime inventory

Open lab Docs

GPU capacitykubernetes-llm-foundations / vllm-production-serving

GPU Node Pool Scheduling Lab

Prove accelerator placement with labels, taints, tolerations, quotas, and unschedulable-pod debugging.

DifficultyHard

Time65 min

Progress statusnot started

Hard

65 min

not started

Platform engineer

kubectl + NVIDIA device plugin + cluster autoscaler

Open lab Docs

RAGrag-platform-engineering / llm-observability-cost

RAG Retrieval Quality Lab

Measure retrieval recall, citation accuracy, tenant filtering, and reranking latency before generation.

DifficultyHard

Time70 min

Progress statusnot started

Hard

70 min

not started

MLOps engineer

evaluation set + vector database + reranker

Open lab Docs

Costllm-observability-cost / production-readiness-ai

Inference Cost Model Lab

Calculate cost per request from input tokens, output tokens, GPU profile, utilization, and cache behavior.

DifficultyMedium

Time45 min

Progress statusnot started

Medium

45 min

not started

AI platform lead

spreadsheet + metrics export + benchmark report

Open lab Docs

Productionproduction-readiness-ai / vllm-production-serving

LLM Rollout and Rollback Lab

Design traffic shifting, readiness gates, rollback triggers, and model-version ownership for inference services.

DifficultyHard

Time60 min

Progress statusnot started

Hard

60 min

not started

SRE

Argo CD + gateway policy + metrics dashboard

Open lab Docs

Securityproduction-readiness-ai / rag-platform-engineering

Multi-Tenant LLM Security Lab

Review tenant routing, namespace boundaries, secrets, NetworkPolicy, prompt logging, and retrieval authorization.

DifficultyHard

Time70 min

Progress statusnot started

Hard

70 min

not started

Security-minded platform engineer

kubectl + NetworkPolicy + admission policy

Open lab Docs

Observabilityllm-observability-cost

LLM Observability and Cost Dashboard Lab

Create a dashboard model that joins user latency, queue wait, GPU pressure, token throughput, and cost signals.

DifficultyMedium

Time50 min

Progress statusnot started

Medium

50 min

not started

SRE

Prometheus + Grafana + OpenTelemetry

Open lab Docs