Modal Burst-Overflow Scalability Plan
Owner: erik@boson.ai Last updated: 2026-06-16 Status: Implementation-ready — all §10 decisions locked
Provenance of figures. Numbers are tagged by source so none are taken on faith: [M] measured by us on Modal this session · [D] Modal published docs · [T] team-provided · [E] estimate (illustrative, not measured — flagged inline). • [M]: 0.92 GB/s volume read, 0→12→0 autoscale, 15-req→9-container fan-out, region pin
us-east→us-east-2. • [T]: 75 GB image, ~40 GB VRAM, FP8, 16k context, $100k/mo budget. • [D]: "<1s" start, 3–10× snapshot speedup, 1.5×/1.75× region premium, 1–2 GB/s volume, ≤5 proxy IPs. • [E]: the §2 K8s-vs-Modal cold-start ranges and the SGLang init time — directional only.
1. Goal#
Use Modal as a burst-overflow inference tier, not the primary serving layer.
- In-house GPU cluster handles stable/baseline traffic.
- Modal absorbs spikes only, then scales to zero so we pay nothing when idle.
- Model is API-compatible (OpenAI schema via SGLang) so our gateway can treat in-house and Modal as interchangeable backends.
Non-goals: running 24/7 max-throughput on Modal (in-house is cheaper for constant load); training (this doc is inference-serving only).
2. What Modal actually is (architecture)#
Modal is not Kubernetes. It is a custom serverless GPU platform:
| Layer | Implementation |
|---|---|
| Orchestration | Custom multi-cloud scheduler in Rust (deliberately not Docker/K8s) |
| Container runtime | Custom Rust runtime, sub-second container start |
| Isolation | gVisor (multi-tenant sandbox; first to run gVisor w/ GPUs) |
| Filesystem | ImageFS — lazy-loading, content-addressed, FUSE. Images stream on demand |
| Hardware | Bare-metal GPU instances pooled across clouds (e.g. Oracle OCI bare metal) |
Implications for us:
- No nodes / pods / cluster / HPA to manage. Scaling = Modal primitives, not K8s objects.
- Lazy content-addressed FS → a 75 GB image does not fully download on boot; only files actually read are fetched + cached. Code cold-start is fast; weight loading is the real cost.
- gVisor adds a thin syscall-compat layer — negligible for SGLang inference.
Why container start is "<1s" (and what it excludes). The <1s is the runtime bringing up the container process on an already-warm bare-metal machine. Modal hits it by deleting the three things that make a "normal" cold start slow:
| Normal slow step | Typical time [E] | Modal's trick |
|---|---|---|
| Provision a VM / add a K8s node | 30 s – 5 min | Warm bare-metal pool, GPUs pre-attached → ~0 |
docker pull (extract all layers first) |
30 s – min (multi-GB GPU images) | ImageFS lazy mount — start now, stream files on demand |
| Container runtime setup | 0.1 – 1 s | Rust runtime + gVisor, optimized hot path |
The time ranges above and the 2–10 min K8s figure below are [E] typical industry estimates, illustrative only — not measured by us, not published by Modal. They convey the direction (Modal removes the node-provision and image-pull waits), not benchmarked values.
The "<1s" Modal start [D] excludes loading our model weights. A GPU pod scaling from zero on K8s is typically 2–10 min [E] end-to-end; Modal collapses that to seconds — but our service cold start is still gated by weight load + SGLang init (see §5).
3. Our workload facts#
| Fact | Value |
|---|---|
| Image size (weights + SGLang + CUDA) | ~75 GB |
| Model | Closed-source, FP8 (no HF id; weights from our private store) |
| Runtime VRAM | ~40 GB (FP8 model + 16k KV cache) → fits on one GPU |
| Serving engine | SGLang |
| GPU | Any single FP8-capable GPU with enough VRAM (H100 / H200 / L40S …) — take whatever Modal has (D3) |
| GPU count | 1 (keeps us on the supported single-GPU snapshot path) |
4. Target architecture#
A pure-CPU control-plane server (co-located with the gateway, inside the internal network with access to internal Redis and public S3) owns API standardization, job status, the MP4 upload, and the spillover decision. The GPU renderer — in-house or Modal — is interchangeable, stateless compute behind it.
4.1 The endpoint: one URL, load-balanced across replicas (verified)#
A Modal web endpoint is one stable URL, not a per-machine address:
- URL is deterministic & stable:
https://<workspace>--<app>-<function>.modal.run; unchanged across redeploys. Our gateway points at this single URL. - Verified fan-out: 15 concurrent requests to one URL were served by 9 distinct containers spanning AWS + Azure + GCP (no region pin → full multi-cloud pool).
- Per-instance URLs? No. Individual containers are not addressable by default — there is
no
container-1.modal.run. The only way to reach a specific container ismodal.forward()/ Tunnels (an advanced feature for direct port exposure, e.g. distributed training) — not used here.
4.2 Where load-balancing behavior is defined#
Modal owns the routing; you own the knobs — there is no custom routing/LB config to write.
| Concern | Who controls it | How |
|---|---|---|
| Request → container routing | Modal proxy (automatic) | Routes to any container with spare capacity. No rules to define. |
| How many requests per container before fanning out | You | @modal.concurrent(target_inputs=…, max_inputs=…) — target_inputs = autoscaler goal, max_inputs = hard ceiling (containers burst to this if scale-up lags) |
| How many containers exist | You | min_containers, buffer_containers, max_containers, scaledown_window |
| Session affinity / sticky routing | Not available in Modal | Routing is stateless. If we ever need affinity or custom LB, it goes in our gateway in front of Modal, or via separate named apps. |
| Region of placement | You (optional) | region= (adds 1.5–1.75× cost; default = unpinned multi-cloud pool) |
For our stateless OpenAI-style inference, the default stateless LB is exactly right — we only tune
@modal.concurrent (throughput/$ per replica) and the autoscaler bounds.
4.3 Scale up/down (measured 2026-06-29)#
Tested with min_containers=0, scaledown_window=20, @modal.concurrent(max_inputs=1),
driving 12 concurrent requests at the single URL:
| Phase | Containers |
|---|---|
| At rest | 0 (scaled to zero, $0) |
| Under load (12 concurrent) | 12 (one per request, max_inputs=1) |
Idle, within scaledown_window |
12 (held ~20–24s after last request) |
| Idle +32s → +40s | 12 → 1 → 0 |
Confirms the overflow economics: replicas exist only during the burst + a short cooldown,
then drain to zero. scaledown_window tunes how long warm replicas linger (latency vs cost).
4.4 Governing principle — control plane vs render plane#
Modal is a stateless render plane (compute only). The upper layer — a pure-CPU control-plane server co-located with the gateway — owns all state. Modal scales the renderer; status, lifecycle, Redis, and S3 stay above it.
| Concern | Owner |
|---|---|
API standardization (POST /v1/videos, GET /{id}, GET /{id}/content) |
CPU control-plane server |
Job lifecycle, video_id, status/progress |
CPU control-plane server → Redis (internal) |
| Render dispatch + spillover (in-house vs Modal) | CPU control-plane server |
| MP4 upload to S3 (+ presigned reads) | CPU control-plane server (public S3 access) |
| GPU rendering (frames → MP4 bytes) | Renderer (in-house / Modal) — pure compute, stateless |
Because the renderer holds no state, the hard distributed-systems questions disappear:
"which container has the mp4?" (none — the CPU control plane uploads it to S3 by video_id),
sticky sessions (N/A — fire-and-forget), keeping replicas warm / region pinning (nothing to keep).
The renderer touches nothing internal — it just returns the MP4 bytes to the CPU control
plane, which (already inside the internal network) owns Redis and S3. So there is
no HTTPS callback, no egress proxy, and no VPC on the Modal side.
5. Cold-start strategy (the core problem)#
Anatomy of our cold start (what we're actually fighting)#
container start (<1s [D]) + weight load (~43s [M-derived]) + SGLang init/compile (10s–min [E])
- The <1s [D] is Modal's floor (warm pool + lazy FS). We don't optimize this.
- The ~43s [M] weight load is derived from our measured 0.92 GB/s × 40 GB (see below); the init time is [E] — an estimate to confirm at bring-up. The snapshot captures the warm server so a restored cold start skips both → back toward the few-seconds range.
Scale-to-zero means the cold start lands exactly when a burst hits. Three layers fix it:
Split the 75 GB image
- Image = SGLang + CUDA only (few GB).
- Weights → Modal Volume (
hf-cache), pinned to an exact revision. - Pre-compiled kernels (DeepGEMM) → a second Volume, built at image-build time.
Memory snapshot of the warm SGLang server (officially supported; 3–10× faster [D])
- Single GPU →
enable_memory_snapshot=True+experimental_options={"enable_gpu_snapshot": True}. @modal.enter(snap=True): start SGLang, warm it,/release_memory_occupation, then snapshot.@modal.enter(snap=False):/resume_memory_occupationon restore.- Env:
TORCHINDUCTOR_COMPILE_THREADS=1(else torch.compile breaks snapshots). - Note: snapshot "warms up" over the first ~5 cold starts, not instantly on day one.
- Single GPU →
Burst headroom
min_containers=0(pay nothing idle) +buffer_containers=1(first spike request doesn't eat a full cold start).- Optional predictive pre-warm: bump
min_containerson a schedule before known peaks.
Measured Volume load speed (benchmarked 2026-06-16)#
- [M] Cold read off
hf-cache: 0.92 GB/s (3.09 GB in 3.35 s, single CPU box, single file). - [M-derived] Extrapolated to 40 GB: ~43 s worst-case storage read.
- This ~43 s is the one-time cost (snapshot build + first few boots); the snapshot skips it on every subsequent burst. Real serving box (modern GPU + sharded parallel load) should beat this, toward Modal's advertised 1–2 GB/s [D].
6. Security & networking ("proxy")#
Two distinct concerns:
Inbound — protect the endpoint (REQUIRED)#
A public GPU endpoint that cold-starts on request is a cost/DoS risk.
- Use Proxy Auth Tokens:
requires_proxy_auth=Trueon the web endpoint (D7). - The CPU control-plane server is the only caller — it holds the
Modal-Key+Modal-Secretand sends them on every render request. - Modal rejects unauthorized requests at its edge — no container spins up → junk traffic
is free. Strictly better than app-level
--api-key(which rejects only after boot). - Keep the SGLang
--api-keyas defense-in-depth.
Outbound — static egress IP (ONLY IF NEEDED)#
- Needed only if weights/data live behind our firewall (private registry, internal S3, DB).
modal.Proxy(WireGuard, static IP to allowlist). Team/Enterprise plan, ≤5 IPs.- Skip if weights come from Hugging Face or IAM-keyed public S3.
7. Autoscaling knobs (Modal primitives, not K8s)#
| Knob | Setting | Purpose |
|---|---|---|
min_containers |
0 (or small for hot peaks) |
Pay nothing idle |
buffer_containers |
1–2 |
Absorb first spike without cold start |
max_containers |
set a ceiling | Cost cap / blast-radius limit |
@modal.concurrent(max_inputs=N) |
tuned | Requests packed per replica (SGLang batches → $/token lever) |
scaledown_window |
e.g. 300 s | How long idle replicas linger before →0 |
8. Cost model#
- Per-second GPU billing. Idle = $0 (scale-to-zero).
- Burst cost ≈ (GPU $/s) × (replicas) × (burst duration).
max_containerscaps it. - Budget: $100k / month (D10) — enforce via Modal workspace spend limit + billing alerts;
size
max_containersso worst-case sustained burst stays within budget. - Cheaper than a second always-on in-house cluster for spiky overflow; not for constant load.
- Region pinning premium [D] (Modal pricing docs, not measured by us): broad region 1.5×,
narrow region 1.75× base cost. Default = no pin. What we did verify live [M]:
region=works —region="us-east"→ landedus-east-2(AWS).
9. Go-live checklist#
Must-have
- Weights provisioning job →
boson-weightsVolume, pinned (closed-source, private store) - Pre-compiled DeepGEMM kernels → second Volume
- SGLang serving
modal.Cls: any single FP8-capable GPU, GPU snapshot,release/resume, respawn-on-fail -
requires_proxy_auth=True+ proxy-auth tokens issued to the control-plane server - Secrets: private weight-store creds (provisioning job), SGLang api-key, proxy-auth token
-
max_containerssized to the $100k/mo budget + Modal spend limit/alerts - Health/readiness probe (post snapshot-resume) for the control server
- Control-server spillover: saturation signal + hysteresis + drain-back; forward overflow to the Modal URL
Nice / depends
-
dev+prodModal environments;modal deployin CI; pinned image + model revisions - Observability: export logs/metrics; track spillover rate, cold-start p99, $/burst
- Region pinning (latency / data residency)
- Static-IP
modal.Proxy(only if weights/data behind firewall)
10. Decisions (locked)#
All decisions resolved with the team — values below are final.
| # | Decision | Value | Notes |
|---|---|---|---|
| D1 | Model source | Closed-source — no HF id | Private weights provisioned once into a Modal Volume from our own store (pinned). Store creds are a Secret used only by the provisioning job; the renderer reads weights from the Volume. |
| D2 | Quantization | FP8 (done) | Model already FP8. Constrains GPU choice to FP8-capable hardware (D3). |
| D3 | GPU | Any single FP8-capable GPU with enough VRAM — no fixed type | Take whatever Modal has via a preference list (e.g. gpu=["H100","H200","L40S"]), ordered by availability/cost; exclude non-FP8 parts (e.g. A100). Still 1 GPU (keeps single-GPU snapshot path). VRAM floor = FP8 model + 16k KV cache (≈ the ~40 GB runtime + headroom). Caveat: GPU snapshots are hardware-specific → expect one snapshot baked per GPU type in the pool (validate; narrow the list if it hurts). |
| D4 | Context length | 16k (--context-length 16384) |
Size KV cache for 16k; included in the D3 VRAM floor. |
| D5 | Snapshot / failure | GPU snapshot ON, no CPU fallback | enable_gpu_snapshot=True. On snapshot/GPU failure: terminate and respawn a fresh container — no degraded CPU path. |
| D6 | Warm capacity | min_containers=0, buffer_containers=1 |
Pure overflow = $0 idle; buffer absorbs the first spike. |
| D7 | Inbound auth | requires_proxy_auth=True; tokens held by the control-plane server |
The Modal URL is public, but only the CPU control server calls it, so it carries the Modal-Key/Modal-Secret. Unauthorized traffic is rejected at Modal's edge — no GPU spin-up. |
| D8 | Outbound egress proxy | Not needed | Renderer needs no internal access; the control server reaches the renderer. |
| D9 | Load distribution | Use Modal's built-in autoscaling + LB | Individual pods aren't externally addressable, so we send overflow to the single Modal endpoint and let Modal scale/route internally. The control server decides when to spill (in-house saturated) and forwards to the Modal URL. |
| D10 | Cost ceiling | $100k / month | Enforce via Modal workspace spend limit + billing alerts; size max_containers so worst-case sustained burst stays within budget. |
| D11 | Region | Default (no pin) | Avoids the 1.5–1.75× premium; maximum availability. |
| D12 | Environments | dev + prod, deploy via CI |
Pinned image + weights per env. |
✅ All decisions locked — ready to implement.
11. References#
- SGLang + Modal Snapshots example — https://modal.com/docs/examples/sglang_snapshot
- Memory Snapshots guide — https://modal.com/docs/guide/memory-snapshot
- GPU mem snapshots (blog) — https://modal.com/blog/gpu-mem-snapshots
- Proxy Auth Tokens — https://modal.com/docs/guide/webhook-proxy-auth
- Modal Proxies (static egress IP) — https://modal.com/docs/guide/proxy-ips
- Autoscaling — https://modal.com/docs/guide/scale
- High-performance LLM inference — https://modal.com/docs/guide/high-performance-llm-inference
12. Prototype artifacts (this workspace)#
In ~/modal_examples/ (workspace bosonai):
vllm_server.py— OpenAI-compatible server pattern (vLLM; SGLang version TBD)volume_speed.py— the Volume load-speed benchmark (0.92 GB/s result above)vllm_infer.py— offline batch inference reference- Volume
hf-cache— holds test weights
13. Implementation runbook (step by step)#
Prereqs:
modalCLI authed (workspacebosonai). Fill the §10<TODO>s first.
Step 0 — Secrets
modal secret create hf-token HF_TOKEN=<token> # if gated/private model
modal secret create sglang-api-key SGLANG_API_KEY=<random> # app-level key (defense in depth)
# Proxy-auth tokens are created in the Modal dashboard (Settings → Proxy Auth Tokens),
# then handed to the gateway as Modal-Key / Modal-Secret.
Step 1 — Provision weights into the Volume (one-time job)
- A throwaway Modal Function that pulls the closed-source weights from our private store
(creds via a Secret) into the
boson-weightsVolume, thenvolume.commit(). - Pin the exact version so Modal serves the same weights as in-house (D1).
Step 2 — Pre-compile kernels into a second Volume
- At image build time run
sglang.compile_deep_gemm→ write artifacts to akernelsVolume. - Avoids JIT during snapshot creation.
Step 3 — Build the serving image (SGLang + CUDA only, NO weights — see §14 skeleton).
Step 4 — First deploy + warm the snapshot
modal deploy sglang_server.py
- Hit the endpoint ~5 times to let the snapshot "bake" (warm-up over first few cold starts).
Step 5 — Validate cold start
- Let it scale to zero (wait >
scaledown_window), then send one request and measure TTFT. - Target: a few seconds restored (vs tens-of-seconds–minutes unsnapshotted).
Step 6 — Wire the gateway router
- Add the Modal endpoint as an OpenAI backend with
requires_proxy_authheaders. - Implement spillover (D9): saturation signal + hysteresis + drain-back to in-house.
Step 7 — Guardrails & observability
- Set
max_containers(D10). Export logs/metrics. Track spillover rate, cold-start p99, $/burst.
Step 8 — Promote dev → prod (separate Modal environments, deploy via CI).
14. Reference skeleton — sglang_server.py (to be filled)#
import os, subprocess
import modal
MODEL_PATH = "/weights/boson-avatar" # D1 — closed-source, provisioned into Volume (no HF id)
PORT = 30000 # SGLang default
image = (
modal.Image.from_registry("nvidia/cuda:12.x-devel-ubuntu22.04", add_python="3.12")
.pip_install("sglang[all]==<TODO>", "huggingface_hub[hf_transfer]")
.env({
"HF_HUB_ENABLE_HF_TRANSFER": "1",
"TORCHINDUCTOR_COMPILE_THREADS": "1", # required for snapshot + torch.compile
"SGLANG_ENABLE_JIT_DEEPGEMM": "1",
})
# .run_function(precompile_deepgemm, volumes={...}) # Step 2
)
weights = modal.Volume.from_name("boson-weights", create_if_missing=True) # D1 private weights
kernels = modal.Volume.from_name("sglang-kernels", create_if_missing=True)
app = modal.App("boson-sglang-overflow")
@app.cls(
image=image,
gpu=["H100", "H200", "L40S"], # D3 — any single FP8-capable GPU w/ enough VRAM
volumes={"/weights": weights, "/kernels": kernels},
secrets=[modal.Secret.from_name("sglang-api-key")], # D1 weights live in the Volume, no model secret
enable_memory_snapshot=True, # D5 (GPU snapshot; no CPU fallback — respawn on fail)
experimental_options={"enable_gpu_snapshot": True},
min_containers=0, buffer_containers=1, # D6
max_containers=int(os.getenv("MAX_CONTAINERS", "...")), # D10 — sized to the $100k/mo budget
scaledown_window=300,
timeout=20 * 60,
)
@modal.concurrent(max_inputs=32) # tune to KV headroom
class SGLangServer:
@modal.enter(snap=True)
def start_and_warm(self):
# launch sglang server -> wait ready -> warm requests -> /release_memory_occupation
cmd = [
"python", "-m", "sglang.launch_server",
"--model-path", MODEL_PATH,
"--host", "0.0.0.0", "--port", str(PORT),
"--api-key", os.environ["SGLANG_API_KEY"],
"--quantization", "fp8", # D2
"--context-length", "16384", # D4 — 16k
]
self.proc = subprocess.Popen(cmd)
# _wait_until_healthy(); _warmup(); _post("/release_memory_occupation")
@modal.enter(snap=False)
def resume(self):
... # _post("/resume_memory_occupation")
@modal.web_server(port=PORT, requires_proxy_auth=True, startup_timeout=10 * 60) # D7
def serve(self):
... # SGLang already serves the OpenAI-compatible API on PORT
This is a structural skeleton, not a tested file. The exact SGLang flags, the warm-up / release / resume calls, and the
web_servervsasgi_appwiring get finalized against the real model + SGLang version once D1/D2/D4 are confirmed.
15. Async render flow (maps to Boson video API)#
A pure-CPU control-plane server sits between the gateway and the GPU renderers. Deployed in the gateway's internal network (so it can reach internal Redis and public S3), it is the control plane: it standardizes the Boson video API, tracks job status in Redis, dispatches renders, and uploads the finished MP4 to S3. The GPU renderer — in-house or Modal — is pure compute: it receives a render request and returns the MP4 bytes. It never touches Redis or S3.
API (unchanged): POST /v1/videos → {id, status:"queued", progress, ...}; GET /v1/videos/{id}
(status); GET /v1/videos/{id}/content (MP4). Status ∈ queued|in_progress|completed|failed.
| Tier | Network | Responsibilities |
|---|---|---|
| Gateway | edge | ingress, auth, route to control plane |
| CPU control-plane server | internal (Redis + public S3) | API standardization, job status in Redis, render dispatch + spillover, MP4 upload to S3 |
| GPU renderer (in-house / Modal) | compute only | render frames → return MP4 bytes; no Redis, no S3, no state |
Why this is clean — the renderer touches nothing internal. Redis and S3 are owned entirely by the CPU control plane, which already lives in the internal network. The Modal renderer only needs its inbound model weights (read from the Modal Volume, provisioned once from our private store) and returns MP4 bytes over its proxy-auth'd endpoint. So:
- No HTTPS callback, no
modal.Proxyegress IP, no VPC — the renderer never reaches Redis or S3. - Fully stateless: no container affinity, no "which container has the MP4", no sticky sessions (§4.4).
Status reporting — decision: the CPU control plane writes Redis directly (it's already inside the internal network). The render call is just request→MP4-bytes. Superseded: the earlier "HTTPS callback from the renderer" idea is no longer needed, because the renderer no longer owns S3 or status.
Progress granularity: coarse status (queued → in_progress → completed) comes free from the
dispatch lifecycle. For live progress 0–100, have the renderer stream progress events back to
the control plane during the render; otherwise the control plane reports coarse status.
Work items:
- CPU control-plane server: Boson API standardization + Redis status writer + S3 uploader
- Render dispatch with spillover (in-house default, Modal on saturation)
- Modal render endpoint returns MP4 bytes (proxy-auth protected); optional progress stream