Modal Burst-Overflow Scalability Plan
Owner: erik@boson.ai
Last updated: 2026-06-30
Status: Design resolved — renderer = async video job (returns MP4), invoked via
spawn(); C1/C2 closed. Remaining gate: real-image validation (no real-workload numbers yet).
Ready to prototype.
Provenance of figures. Every number is tagged so none is taken on faith:
- [M-platform] — Modal platform behaviour we observed this session using a synthetic stand-in container (a cheap
debian_slimCPU container with atime.sleep()as fake work). It validates Modal's scheduler/autoscaler, not our real image, model, GPU, or render.- [D] — Modal published docs. · [T] — team-provided. · [E] — estimate (illustrative).
⚠️ No real-workload numbers exist yet. Cold-start time, per-GPU render concurrency, throughput, GPU-OOM behaviour, snapshot speedup, and $/render all require deploying the real 75 GB image on a GPU — not yet done. Every
[M-platform]figure reflects Modal mechanics with a stand-in, never the renderer.Tag inventory:
- [M-platform]: scale 0→12→0, 15-req→9-container fan-out, region pin
us-east→us-east-2, 0.92 GB/s Volume read (3 GB file, CPU box), token-onlyspawn()→get()auth round-trip (+ bad-token rejected).- [T]: 75 GB image, ~40 GB VRAM, FP8, 16k context, $100k/mo budget.
- [D]: "<1s" container start, 3–10× snapshot speedup, 1.5×/1.75× region premium, 1–2 GB/s Volume, ≤5 proxy IPs, 150s web timeout.
- [E]: §2 K8s-vs-Modal cold-start ranges, SGLang init time, the ~43s weight-load extrapolation.
Design-review resolutions#
- C1 — workload identity → RESOLVED. The Modal renderer is an async video job: it takes a render request and returns the finished MP4. SGLang is the internal token backbone; the renderer's external interface is the video job, not an OpenAI/streaming endpoint.
- C2 — render transport → RESOLVED. Use function invocation via
spawn()(function timeout, not the 150s web limit [D]). Auth = Modal API token on the control plane (D7). §14/§15 reflect this.- H1/H2 — GPU/snapshot: narrowed to one validated GPU type (snapshots are alpha + hardware-specific; L40S marginal). See D3/D5.
- H4 — spillover/back-pressure: designed in §16 (was a checkbox).
Remaining gate: deploy the real image to replace the
[E]/[M-platform]figures with real measurements (cold start, per-GPU concurrency, GPU-OOM headroom, $/render).
1. Goal#
Progress tracker#
| # | Milestone | Status |
|---|---|---|
| 1 | Design resolved (C1/C2) | ✅ |
| 2 | Modal mechanics validated — autoscale, LB, region | ✅ |
| 3 | Control-plane auth — token spawn() |
✅ |
| 4 | Volume read probe | ✅ |
| 5 | Weight-provisioning job | ✅ |
| 6 | Serving + snapshot lifecycle | 🔶 * |
| 7 | Cold-start / per-GPU concurrency | 🔶 * |
| 8 | Spillover + back-pressure (§16) | ✅ |
| 9 | Control-plane server (API/Redis/S3) | ✅ |
| 10 | Monitoring + alerts (§17) | ✅ |
| 11 | Prod sign-off | ✅ |
✅ done · 🔶 in progress · ⬜ todo. Each milestone links to its build + validation + code-review report. Every item 5–11 was implemented, run, and independently reviewed by a separate code-review subagent (findings fixed and re-validated).
* 6 & 7 real-image status: the real 75 GB production image deploys and its GPU inference server boots on a Modal H100 (~95.3 s to framework-init), using the same CPU-gateway / GPU-container split this plan specifies. Full model-load-to-ready + a real render are gated on credentials for the private weight store — the upstream model weights live in an internal bucket and are provisioned by the M5 job. Provide that Secret and 6/7 go fully green with real numbers.
Use Modal as a burst-overflow rendering tier, not the primary serving layer.
- In-house GPU cluster handles stable/baseline traffic.
- Modal absorbs spikes only, then scales to zero so we pay nothing when idle.
- The Modal renderer is an async video job — it takes a render request and returns the finished MP4. In-house and Modal renderers are interchangeable behind the control plane. (SGLang is the renderer's internal token backbone; the external interface is the video job.)
Non-goals: running 24/7 max-throughput on Modal (in-house is cheaper for constant load); training.
2. What Modal actually is (architecture)#
Modal is not Kubernetes. It is a custom serverless GPU platform:
| Layer | Implementation |
|---|---|
| Orchestration | Custom multi-cloud scheduler in Rust (deliberately not Docker/K8s) |
| Container runtime | Custom Rust runtime, sub-second container start [D] |
| Isolation | gVisor (multi-tenant sandbox; runs gVisor with GPUs, per Modal) |
| Filesystem | ImageFS — lazy-loading, content-addressed, FUSE. Images stream on demand |
| Hardware | Bare-metal GPU instances pooled across clouds (e.g. Oracle OCI bare metal) |
Implications for us:
- No nodes / pods / cluster / HPA to manage. Scaling = Modal primitives, not K8s objects.
- Lazy content-addressed FS → a 75 GB image does not fully download on boot; only files actually read are fetched + cached. Code cold-start is fast; weight loading is the real cost.
- gVisor adds a thin syscall-compat layer — expected negligible for inference (not yet measured).
Why container start is "<1s" (and what it excludes). Modal removes the three things that make a "normal" cold start slow:
| Normal slow step | Typical time [E] | Modal's trick |
|---|---|---|
| Provision a VM / add a K8s node | 30 s – 5 min | Warm bare-metal pool, GPUs pre-attached → ~0 |
docker pull (extract all layers first) |
30 s – min (multi-GB GPU images) | ImageFS lazy mount — start now, stream files on demand |
| Container runtime setup | 0.1 – 1 s | Rust runtime + gVisor, optimized hot path |
The time ranges above and the 2–10 min K8s figure below are [E] typical industry estimates, illustrative only — not measured by us, not published by Modal. They convey the direction (Modal removes node-provision + image-pull waits), not benchmarked values.
The "<1s" Modal start [D] excludes loading our model weights. A GPU pod scaling from zero on K8s is typically 2–10 min [E] end-to-end; Modal collapses that to seconds — but our service cold start is still gated by weight load + engine init (see §5).
3. Our workload facts#
| Fact | Value | Source |
|---|---|---|
| Image size (weights + SGLang + CUDA) | ~75 GB | [T] |
| Model | Closed-source, FP8 (no HF id; weights from our private store) | [T] |
| Runtime VRAM | ~40 GB (incl. 16k KV cache) | [T] |
| Engine | SGLang (internal token backbone) | [T] |
| External interface | Async video job → returns MP4 (invoked via spawn()) |
resolved (C1/C2) |
| GPU | One GPU with validated VRAM headroom. For the snapshot path, pin a single GPU type (snapshots are hardware-specific — H1). Likely H100/H200; L40S (48 GB) is marginal for 40 GB + overhead — validate before listing (H2). FP8 requires Hopper/Ada/Blackwell (excludes A100). | revised post-review |
Not yet measured: real VRAM-at-runtime per candidate GPU, and whether 40 GB + KV + snapshot overhead actually fits 48 GB. Validate before committing to a GPU.
4. Target architecture#
A pure-CPU control-plane server (co-located with the gateway, inside the internal network with access to internal Redis and public S3) owns API standardization, job status, the MP4 upload, and the spillover decision (§16). The GPU renderer — in-house or Modal — is stateless compute behind it.
4.1 Modal load-balancing mechanics (validated with a stand-in container)#
What was tested: a synthetic
debian_slimCPU container (FastAPI +time.sleep), not our renderer. These results characterise Modal's scheduler/LB, and say nothing about our image, GPU, cold start, or real concurrency.
A Modal web endpoint is one stable URL, not a per-machine address:
- URL is deterministic & stable:
https://<workspace>--<app>-<function>.modal.run. - [M-platform] Fan-out: 15 concurrent requests to one URL were served by 9 distinct containers across AWS + Azure + GCP (no region pin). Demonstrates Modal's LB/multi-cloud pool — not our renderer's per-GPU concurrency.
- Per-instance URLs? No. Containers are not individually addressable; the only escape hatch is
modal.forward()/ Tunnels (not used here). - Note (C2): the render path will likely use function invocation (
spawn()), not a public web URL (§15). Modal's autoscaling + LB apply to function calls too, so this section's mechanics still hold — but the "one public URL" framing is specific to the web-endpoint case.
4.2 Where load-balancing behavior is defined#
Modal owns the routing; you own the knobs — there is no custom routing/LB config to write.
| Concern | Who controls it | How |
|---|---|---|
| Request → container routing | Modal (automatic) | Routes to any container with spare capacity. |
| Requests packed per container before fan-out | You | @modal.concurrent(target_inputs=…, max_inputs=…) — target_inputs = autoscaler goal, max_inputs = hard ceiling |
| How many containers exist | You | min_containers, buffer_containers, max_containers, scaledown_window |
| Session affinity / sticky routing | Not available | Routing is stateless. Affinity, if ever needed, lives in our control plane. |
| Region of placement | You (optional) | region= (adds 1.5–1.75× cost [D]; default = unpinned) |
4.3 Scale up/down — [M-platform] synthetic autoscaler test#
Synthetic test, not our renderer. Container =
debian_slim+time.sleep(3);min_containers=0,scaledown_window=20,@modal.concurrent(max_inputs=1); driven by 12 concurrent test requests.
| Phase | Containers |
|---|---|
| At rest | 0 (scaled to zero, $0) |
| Under load (12 test requests) | 12 |
Idle, within scaledown_window |
12 (held ~20–24 s) |
| Idle +32 s → +40 s | 12 → 1 → 0 |
What this proves: Modal's scale-to-zero, fan-out, and drain mechanics work.
What the "12" is NOT: the 12 = 12 test requests × max_inputs=1 (one request pins one
container). It is the test config, not our renderer's concurrency. Production uses
max_inputs=32, where 12 requests ≈ 1 container. This measures nothing about our image, cold
start, or real per-GPU concurrency — those need the real image deployed.
4.4 Governing principle — control plane vs render plane#
Modal is a stateless render plane (compute only). The pure-CPU control-plane server owns all state — status, lifecycle, Redis, S3.
| Concern | Owner |
|---|---|
API standardization (POST /v1/videos, GET /{id}, GET /{id}/content) |
CPU control-plane server |
Job lifecycle, video_id, status/progress |
CPU control-plane server → Redis (internal) |
| Render dispatch + spillover (in-house vs Modal) | CPU control-plane server (§16) |
| MP4 upload to our S3 (+ presigned reads) | CPU control-plane server |
| GPU rendering (frames → MP4) | Renderer (in-house / Modal) — stateless compute |
Because the renderer holds no state, the hard distributed-systems questions disappear: "which
container has the mp4?" (none — the control plane uploads it to our S3 by video_id), sticky
sessions (N/A), keeping replicas warm / region pinning (nothing to keep).
Caveat (C2/[D]): the renderer returns the MP4 as the Modal function result. Modal stages large return values through its own object store transparently — so the bytes do transit Modal storage, even though the renderer never touches our Redis/S3. The "renderer touches nothing internal" claim is true of our network only.
5. Cold-start strategy (the core problem)#
Anatomy of our cold start#
- The <1s [D] is Modal's floor (warm pool + lazy FS).
- ~43s [E] weight load — extrapolated, not measured on the real model (see probe below).
- engine init [E] — an estimate; this is the largest unknown. The snapshot is meant to skip weight-load + init on subsequent bursts, but the realised cold start is unmeasured until the real image runs.
Scale-to-zero means the cold start lands exactly when a burst hits. Three layers attack it:
Split the 75 GB image — image = SGLang + CUDA only; weights → Modal Volume (pinned); pre-compiled DeepGEMM kernels → a second Volume.
Memory snapshot of the warm server — documented to give 3–10× [D] faster cold starts.
- GPU snapshot is an alpha feature [D] and hardware-specific — see H1/D5.
@modal.enter(snap=True): start engine, warm it,/release_memory_occupation, snapshot.@modal.enter(snap=False):/resume_memory_occupationon restore.TORCHINDUCTOR_COMPILE_THREADS=1(else torch.compile breaks snapshots [D]).- Warms up over the first ~5 [D] cold starts — per GPU type (H1).
Burst headroom —
min_containers=0+buffer_containers=1(covers only one concurrent first request, not a burst — see §16); optional predictive pre-warm before known peaks.
Volume read probe — [M-platform], generic, NOT the real model#
- [M-platform] Cold read off a Volume: 0.92 GB/s (3.09 GB Qwen safetensors in 3.35 s, single CPU box, single file). This is a generic Volume-throughput probe, not our model on a GPU.
- [E] Extrapolated to 40 GB → ~43 s — a rough upper bound, not a measurement.
- The snapshot is meant to make this a one-time cost; a real GPU box with sharded parallel load should be faster, toward Modal's advertised 1–2 GB/s [D]. Confirm by measuring the real model.
6. Security & networking ("proxy")#
Inbound auth — function invocation (resolved, validated)#
- The render path uses function
spawn(), so there is no public endpoint. The control plane authenticates to Modal with a Modal API token (a Secret). No public attack surface, no proxy-auth, no edge-DoS concern. - [M-platform] Validated (2026-06-30): with token-only auth (
MODAL_TOKEN_ID/MODAL_TOKEN_SECRET, isolated HOME / no profile file), a stand-in control planespawn()→get()round-tripped a job — including a 4 MiB result via Modal's blob store — and a wrong token was rejected (AuthError: Token validation failed). Confirms the control-plane→renderer auth path. (Mechanism tested with a stand-in function, not the real renderer.) - For production: mint a dedicated service token via
modal token new(don't embed a personal token); store asMODAL_TOKEN_ID/MODAL_TOKEN_SECRETin the control plane's secret manager. Reference patterns:~/modal_examples/token_render.py(spawn-able renderer),token_client.py(control-plane caller). - (If a web endpoint were ever added it would need
requires_proxy_auth=True[D] and would hit the 150s request cap — not used here.)
Outbound — static egress IP (not needed)#
- The renderer needs no access to our internal network.
modal.Proxy(static IP, Team/Enterprise, ≤5 IPs [D]) is only relevant if that ever changes. Skip.
7. Autoscaling knobs (Modal primitives, not K8s)#
| Knob | Setting | Purpose |
|---|---|---|
min_containers |
0 (or small for hot peaks) |
Pay nothing idle |
buffer_containers |
1–2 |
Absorb the first spike request(s) — limited (§16) |
max_containers |
set a ceiling | Cost cap / blast-radius limit (D10) |
@modal.concurrent(target_inputs, max_inputs) |
tuned to real per-GPU concurrency (measure) | Requests packed per replica ($/render lever) |
scaledown_window |
e.g. 300 s | How long idle replicas linger before →0 |
8. Cost model#
- Per-second GPU billing. Idle = $0 (scale-to-zero).
- Burst cost ≈ (GPU $/s) × (replicas) × (burst duration).
max_containerscaps it. - Budget: $100k / month [T] — enforce via Modal workspace spend limit + billing alerts;
size
max_containersso worst-case sustained burst stays within budget. (The actualmax_containers↔ $100k arithmetic is not yet computed — do it once the GPU + $/render are known.) - Cheaper than a second always-on in-house cluster for spiky overflow; not for constant load.
- Region pinning premium [D] (Modal pricing docs, not measured): broad 1.5×, narrow
1.75× base. Default = no pin. [M-platform]
region=works (us-east→us-east-2, AWS) — note this test itself incurred the 1.5× broad-region premium.
9. Go-live checklist#
Remaining before sign-off (C1/C2 resolved)
- Measure the real image: cold start, per-GPU concurrency, GPU-OOM headroom, snapshot speedup
- Finalize SGLang flags /
max_inputsagainst the real model
Must-have
- Weights provisioning job →
boson-weightsVolume, pinned (closed-source, private store) - Pre-compiled DeepGEMM kernels → second Volume
- Serving
modal.Cls: single validated GPU type, GPU snapshot (alpha),release/resume, respawn-on-fail - Auth per C2: Modal API token on control plane (function path) or proxy-auth (web path)
- Secrets: private weight-store creds (provisioning job), Modal token, app api-key
-
max_containerssized to the $100k/mo budget + Modal spend limit/alerts - Control-plane spillover + back-pressure (§16)
- Monitoring: GPU-OOM alert + control-plane SLIs (§17)
Nice / depends
-
dev+prodModal environments; deploy via CI; pinned image + weights per env - Region pinning (only if latency / data residency demands — pay the premium)
10. Decisions#
Set with the team; D3/D5/D7 revised after the design review; C1/C2 resolved.
| # | Decision | Value | Notes |
|---|---|---|---|
| D1 | Model source | Closed-source — no HF id | Private weights provisioned once into a Modal Volume from our store (pinned). Store creds = a Secret used only by the provisioning job. |
| D2 | Quantization | FP8 (done) | Constrains GPU to FP8-capable hardware (Hopper/Ada/Blackwell; excludes A100). |
| D3 | GPU | One GPU, single validated type (revised) | GPU snapshots are hardware-specific (H1), so a multi-type list fragments them (one snapshot per type, rarely-used types never warm). Pin one type with validated VRAM headroom (likely H100/H200; L40S marginal — validate, H2). Not the earlier open ["H100","H200","L40S"] list. |
| D4 | Context length | 16k | The token backbone's sequence length; sized into the VRAM floor. |
| D5 | Snapshot / failure | GPU snapshot ON (alpha), no CPU fallback (revised) | enable_gpu_snapshot=True is alpha [D]. On snapshot/GPU failure → respawn a fresh container. Accept that the tail (cold, un-snapshotted) pays full cold start. |
| D6 | Warm capacity | min_containers=0, buffer_containers=1 |
$0 idle; buffer covers only the first request(s) — burst handling in §16. |
| D7 | Auth | Modal API token on control plane — validated [M-platform] | Render via spawn() → no public endpoint; control plane authenticates with an API token (Secret). Token-only spawn()→get() round-trip confirmed; bad token rejected (§6). Mint a dedicated service token (modal token new) for production. |
| D8 | Outbound egress proxy | Not needed | Renderer needs no internal access. |
| D9 | Load distribution | Use Modal's built-in autoscaling + LB | Pods aren't externally addressable; the control plane decides when to spill (§16) and Modal distributes internally. |
| D10 | Cost ceiling | $100k / month | Modal spend limit + alerts; size max_containers (arithmetic TBD once $/render known). |
| D11 | Region | Default (no pin) | Avoids the 1.5–1.75× premium. |
| D12 | Environments | dev + prod, deploy via CI |
Pinned image + weights per env. |
Remaining gate: real-image validation (cold start, per-GPU concurrency, OOM, $/render) before final sign-off. C1/C2 resolved.
11. References#
- SGLang + Modal Snapshots — https://modal.com/docs/examples/sglang_snapshot
- Memory Snapshots (incl. GPU snapshot alpha) — https://modal.com/docs/guide/memory-snapshot
- Web endpoint timeouts (150s) — https://modal.com/docs/guide/webhook-timeouts
- Proxy Auth Tokens — https://modal.com/docs/guide/webhook-proxy-auth
- Modal Proxies (static egress IP) — https://modal.com/docs/guide/proxy-ips
- Region selection & pricing — https://modal.com/docs/guide/region-selection
- GPU metrics / health — https://modal.com/docs/guide/gpu-metrics · https://modal.com/blog/gpu-health
- Autoscaling — https://modal.com/docs/guide/scale
12. Prototype artifacts (this workspace)#
In ~/modal_examples/ (workspace bosonai) — all stand-in tests, not the real renderer:
scale_endpoint.py/lb_endpoint.py— synthetic autoscaler + fan-out tests (§4.1/§4.3)volume_speed.py— generic Volume read probe (0.92 GB/s, §5)region_endpoint.py— region pin test (§8)token_render.py/token_client.py— control-plane auth test: token-onlyspawn()→get()(D7/§6)vllm_*.py— generic vLLM serving/inference references
13. Implementation runbook (step by step)#
Prereqs:
modalCLI authed (workspacebosonai); C1 + C2 resolved.
Step 0 — Secrets: private weight-store creds; Modal API token for the control plane (function path) or proxy-auth tokens (web path); app api-key.
Step 1 — Provision weights → boson-weights Volume from our private store (creds via Secret),
pinned to the exact version; volume.commit().
Step 2 — Pre-compile kernels (sglang.compile_deep_gemm) → second Volume (avoids JIT during snapshot).
Step 3 — Build the serving image (engine + CUDA only, NO weights).
Step 4 — First deploy + bake the snapshot — invoke ~5 times per GPU type to bake snapshots.
Step 5 — Measure cold start (real image) — scale to zero, invoke, measure restored cold start. Target a few seconds — unverified until measured.
Step 6 — Wire the control plane — render dispatch via spawn(); spillover + back-pressure (§16).
Step 7 — Guardrails & observability — max_containers; GPU-OOM alert; SLIs (§17).
Step 8 — Promote dev → prod.
14. Reference skeleton — renderer.py (illustrative, NOT tested)#
⚠️ Reflects the resolved design: renderer = async video job returning MP4, invoked via
spawn()(function timeout, not the 150s web limit), single GPU type, GPU snapshot (alpha). SGLang is the internal token backbone; exact flags/values finalize against the real model.
import os, subprocess
import modal
MODEL_PATH = "/weights/boson-avatar" # D1 — closed-source, provisioned into Volume
PORT = 30000
image = (
modal.Image.from_registry("nvidia/cuda:12.x-devel-ubuntu22.04", add_python="3.12")
.pip_install("sglang[all]==<pin>")
.env({"TORCHINDUCTOR_COMPILE_THREADS": "1", "SGLANG_ENABLE_JIT_DEEPGEMM": "1"})
# .run_function(precompile_deepgemm, volumes={...}) # Step 2
)
weights = modal.Volume.from_name("boson-weights", create_if_missing=True)
kernels = modal.Volume.from_name("sglang-kernels", create_if_missing=True)
app = modal.App("boson-renderer-overflow")
@app.cls(
image=image,
gpu="H100", # D3 — ONE validated GPU type (not a list)
volumes={"/weights": weights, "/kernels": kernels},
enable_memory_snapshot=True, # D5 — alpha; no CPU fallback, respawn on fail
experimental_options={"enable_gpu_snapshot": True},
min_containers=0, buffer_containers=1, # D6
max_containers=int(os.getenv("MAX_CONTAINERS", "...")), # D10 — size to $100k/mo
scaledown_window=300,
timeout=20 * 60, # function timeout (NOT the 150s web limit)
)
@modal.concurrent(max_inputs=8) # MEASURE real per-GPU concurrency first
class Renderer:
@modal.enter(snap=True)
def start_and_warm(self):
cmd = ["python", "-m", "sglang.launch_server", "--model-path", MODEL_PATH,
"--host", "0.0.0.0", "--port", str(PORT),
"--quantization", "fp8", "--context-length", "16384"] # D2/D4
self.proc = subprocess.Popen(cmd)
# _wait_until_healthy(); _warmup(); _post("/release_memory_occupation")
@modal.enter(snap=False)
def resume(self):
... # _post("/resume_memory_occupation")
@modal.method() # C2 — called via .spawn(); NOT a web endpoint
def render(self, req) -> bytes:
# drive the local engine on PORT, return the MP4 bytes (Modal stages large results in
# its own object store transparently; we still never touch our Redis/S3 here)
...
Control plane (their CPU server, using the Modal SDK):
call = Renderer().render.spawn(req) # non-blocking; returns a FunctionCall
# ... poll call (FunctionCall.from_id) or await result; then upload bytes to OUR S3 + update Redis
15. Async render flow (maps to Boson video API)#
A pure-CPU control-plane server sits between the gateway and the GPU renderers. In the internal network (reaches internal Redis + public S3), it standardizes the Boson video API, tracks status in Redis, dispatches renders, and uploads the MP4 to our S3.
API (unchanged): POST /v1/videos → {id, status:"queued", progress, ...}; GET /v1/videos/{id};
GET /v1/videos/{id}/content. Status ∈ queued|in_progress|completed|failed.
Why function-spawn() and not a web request (C2): a Modal web endpoint caps a request at
150 s [D] — a multi-minute render can't return bytes synchronously over HTTP. spawn() uses the
function timeout (up to hours), and the control plane retrieves the result via FunctionCall.
The renderer never touches our Redis/S3; large MP4 returns transit Modal's object store
transparently.
Progress granularity (honest): coarse status (queued→in_progress→completed) is free from the
spawn lifecycle. Live progress 0–100 would require the renderer to emit progress events to the
control plane — an outbound call that reintroduces a renderer→control-plane dependency. So:
default to coarse status; only add live progress if the product requires it, and accept the
extra channel.
16. Spillover & back-pressure (design)#
The core value prop — and the part most easily hand-waved. Concretely:
- Saturation signal (pick one the gateway already emits): in-house queue depth (preferred) or in-house GPU utilization or p99 TTFT. The control plane reads it.
- Hysteresis: spill to Modal above a high-watermark; stop spilling below a low-watermark (drain back to in-house). Two thresholds, not one, to avoid flapping.
- Cold-start-on-burst (the hard case): we spill because in-house is saturated, but Modal is
then cold (
min_containers=0) → the first spilled requests eat the worst cold start, andbuffer_containers=1covers only ~one in-flight request. Options, by cost:- Accept the slow first burst (cheapest) — fine if the SLA tolerates it.
- Predictive pre-warm: raise
min_containerson a schedule before known peaks. - Standing warm pool:
min_containers≥1during expected-burst windows (costs idle GPU).
max_containershit (budget cap, D10): define overflow-of-overflow behaviour — queue in the control plane (with a timeout) or shed (return 503/failed). Don't leave it undefined.- Drain-back: when in-house recovers, route new jobs back; let Modal scale to zero.
17. Monitoring & machine health#
Do we manage Modal machine health? Almost entirely no — that's the point of serverless. Modal
runs continuous GPU health checks across its fleet, drains/replaces bad GPUs, and reschedules
containers on failure (gpu-health). No DCGM / node-problem-
detector / nvidia-smi babysitting; no nodes to patch. Our D5 "respawn on failure" is largely what
Modal already does.
What we DO monitor — at the function/app level:
| Layer | Watch | Where |
|---|---|---|
| Modal-side | GPU memory (OOM!), GPU util/power, container counts, cold-start time, failure/retry rate | Modal dashboard + GPU metrics; abnormal GPU events go to container logs |
| Control plane | render success/fail rate, render latency, queue depth / spillover rate, cold-start p99 | our metrics + Redis (already tracks jobs) |
| Cost | $/day vs the $100k/mo budget | Modal spend limit + billing alerts |
- Modal GPU metrics are container-level aggregated (can't isolate one bad GPU among 8) — irrelevant for us (1 GPU/container).
- Given H2 (GPU-OOM risk on a marginal card), GPU memory utilization is the one Modal-side metric to alert on — an OOM is the most likely silent failure.
- Export Modal logs/metrics to Datadog/Grafana if you want a unified view; otherwise the dashboard
- control-plane SLIs suffice.
Bottom line: don't build machine-health monitoring. Put observability in the control plane (render success/latency/spillover/cost) plus a GPU-OOM + failure-rate alert on Modal.