Modal Burst-Overflow Scalability Plan

Owner: erik@boson.ai Last updated: 2026-07-06 Status: LIVE — the official higgs-avatar v1.0.0 release serves on Modal behind the stable endpoint https://bosonai--avatar.modal.run (hardened: warmup-gated routing, watchdog, liveness probe). Full streaming e2e verified repeatedly; all milestones below executed.

Number provenance ([M]/[D]/[T]/[E] tags): see the Appendix at the end of this doc.

1. Goal#

Progress tracker#

#	Milestone	Status
1	Design resolved — renderer = async video job via `spawn()`	✅
2	Modal mechanics validated — autoscale, LB, region	✅
3	Control-plane auth — token `spawn()`	✅
4	Volume read probe	✅
5	Weight-provisioning job	✅
6	Serving + snapshot lifecycle — real image serves; snapshot measured *	✅
7	Cold-start / concurrency — measured on the real image *	✅
8	Spillover + back-pressure (§16)	✅
9	Control-plane server (API/Redis/local MP4)	✅
10	Monitoring + alerts (§17)	✅
11	Prod sign-off	✅
12	Gateway overflow sidecar (dynamic weight) (§16.1)	✅
13	Real-image streaming e2e — official v1.0.0 avatar endpoint live	✅
14	Production hardening + re-validation — gated routing · watchdog · probe	✅
15	Elastic validation — multi-replica scale-out/in, affinity, GPU matrix	✅

✅ done · 🔶 in progress · ⬜ todo. Each milestone links to its build + validation + code-review report. Every item 5–11 was implemented, run, and independently reviewed by a separate code-review subagent (findings fixed and re-validated).

* 6 & 7 — verified by execution on the real image (details: findings · verification): real renders produced (offline ~44 s/clip, streaming init.fmp4 ≈ 3–9 s, full 3 s-audio clip ≈ 9–12 s e2e); memory-snapshot restore measured 56.8 s vs 130 s boot (alpha: one re-bake took 1213 s — hence the warm-pool posture); cold start: dev fast-boot 131 s to API-up [measured]; v1.0.0 boot→session-ready observed 9–17 min (bucket-count dependent; single-bucket warmup deployed); warm **0.4 s** [measured]. v1.0.0's profile is single-session per replica — concurrency scales by replica count, not per-GPU batching.

Number provenance (re-verified 2026-06-30). Every result number was produced by execution:

Milestones 5–7 + the real image → Modal CLI/SDK (modal run / modal deploy, run IDs in the workspace dashboard). E.g. M5 = 51 files / 1,969,843,995 B; real-image cold-start 95.3 s on H100; image 57.4 GB (Docker Hub API); FunctionStats fields confirmed via the SDK.
Milestones 8–12 are control-plane logic (not Modal) → verified by local python3 mN.py runs.
All threshold / weight / hysteresis values (0.85/0.60, M10's 95%/5%/30 s, poll ~3 s, the max_inputs/max_containers examples) are illustrative starting defaults — chosen, not measured — to be tuned with real traffic. No number here is asserted as a Modal measurement unless it came from an executed Modal run.

Use Modal as a burst-overflow rendering tier, not the primary serving layer.

In-house GPU cluster handles stable/baseline traffic.
Modal absorbs spikes only, then scales to zero so we pay nothing when idle.
The Modal renderer is an async video job — it takes a render request and returns the finished MP4. In-house and Modal renderers are interchangeable behind the control plane. (SGLang is the renderer's internal token backbone; the external interface is the video job.)

Non-goals: running 24/7 max-throughput on Modal (in-house is cheaper for constant load); training.

Modal is not Kubernetes. It is a custom serverless GPU platform:

Layer	Implementation
Orchestration	Custom multi-cloud scheduler in Rust (deliberately not Docker/K8s)
Container runtime	Custom Rust runtime, sub-second container start [D]
Isolation	gVisor (multi-tenant sandbox; runs gVisor with GPUs, per Modal)
Filesystem	ImageFS — lazy-loading, content-addressed, FUSE. Images stream on demand
Hardware	Bare-metal GPU instances pooled across clouds (e.g. Oracle OCI bare metal)

Implications for us:

No nodes / pods / cluster / HPA to manage. Scaling = Modal primitives, not K8s objects.
Lazy content-addressed FS → a 75 GB image does not fully download on boot; only files actually read are fetched + cached. Code cold-start is fast; weight loading is the real cost.
gVisor adds a thin syscall-compat layer — expected negligible for inference (not yet measured).

Why container start is "<1s" (and what it excludes). Modal removes the three things that make a "normal" cold start slow:

Normal slow step	Typical time [E]	Modal's trick
Provision a VM / add a K8s node	30 s – 5 min	Warm bare-metal pool, GPUs pre-attached → ~0
`docker pull` (extract all layers first)	30 s – min (multi-GB GPU images)	ImageFS lazy mount — start now, stream files on demand
Container runtime setup	0.1 – 1 s	Rust runtime + gVisor, optimized hot path

The time ranges above and the 2–10 min K8s figure below are [E] typical industry estimates, illustrative only — not measured by us, not published by Modal. They convey the direction (Modal removes node-provision + image-pull waits), not benchmarked values.

The "<1s" Modal start [D] excludes loading our model weights. A GPU pod scaling from zero on K8s is typically 2–10 min [E] end-to-end; Modal collapses that to seconds — but our service cold start is still gated by weight load + engine init (see §5).

3. Our workload facts#

Fact	Value	Source
Image size (weights + SGLang + CUDA)	~75 GB	[T]
Model	Closed-source, FP8 (no HF id; weights from our private store)	[T]
Runtime VRAM	~40 GB (incl. 16k KV cache)	[T]
Engine	SGLang (internal token backbone)	[T]
External interface	Async video job → returns MP4 (invoked via `spawn()`)	resolved (C1/C2)
GPU	One GPU with validated VRAM headroom. For the snapshot path, pin a single GPU type (snapshots are hardware-specific — H1). Likely H100/H200; L40S (48 GB) is marginal for 40 GB + overhead — validate before listing (H2). FP8 requires Hopper/Ada/Blackwell (excludes A100).	revised post-review

Not yet measured: real VRAM-at-runtime per candidate GPU, and whether 40 GB + KV + snapshot overhead actually fits 48 GB. Validate before committing to a GPU.

4. Target architecture#

§4 — Target architecture. A pure-CPU control-plane server (co-located with the gateway, in the internal network) owns the API, Redis status, the local MP4 store, and render dispatch. The GPU renderer is interchangeable, stateless compute behind it.

A pure-CPU control-plane server (co-located with the gateway, inside the internal network with access to internal Redis; MP4 on local disk by default, S3 optional) owns API standardization, job status, the MP4 upload, and the spillover decision (§16). The GPU renderer — in-house or Modal — is stateless compute behind it.

What was tested: a synthetic debian_slim CPU container (FastAPI + time.sleep), not our renderer. These results characterise Modal's scheduler/LB, and say nothing about our image, GPU, cold start, or real concurrency.

A Modal web endpoint is one stable URL, not a per-machine address:

§4.1 — Modal LB mechanics, [M-platform] (synthetic CPU container, NOT our renderer): 15 requests → 9 containers across AWS + Azure + GCP. Demonstrates Modal's scheduler, not our per-GPU concurrency.

URL is deterministic & stable: https://<workspace>--<app>-<function>.modal.run.
[M-platform] Fan-out: 15 concurrent requests to one URL were served by 9 distinct containers across AWS + Azure + GCP (no region pin). Demonstrates Modal's LB/multi-cloud pool — not our renderer's per-GPU concurrency.
Per-instance URLs? No. Containers are not individually addressable; the only escape hatch is modal.forward() / Tunnels (not used here).
Note (C2): the render path will likely use function invocation (spawn()), not a public web URL (§15). Modal's autoscaling + LB apply to function calls too, so this section's mechanics still hold — but the "one public URL" framing is specific to the web-endpoint case.

4.2 Where load-balancing behavior is defined#

Modal owns the routing; you own the knobs — there is no custom routing/LB config to write.

Concern	Who controls it	How
Request → container routing	Modal (automatic)	Routes to any container with spare capacity.
Requests packed per container before fan-out	You	`@modal.concurrent(target_inputs=…, max_inputs=…)` — `target_inputs` = autoscaler goal, `max_inputs` = hard ceiling
How many containers exist	You	`min_containers`, `buffer_containers`, `max_containers`, `scaledown_window`
Session affinity / sticky routing	Not available	Routing is stateless. Affinity, if ever needed, lives in our control plane.
Region of placement	You (optional)	`region=` (adds 1.5–1.75× cost [D]; default = unpinned)

4.3 Scale up/down — [M-platform] synthetic autoscaler test#

Synthetic test, not our renderer. Container = debian_slim + time.sleep(3); min_containers=0, scaledown_window=20, @modal.concurrent(max_inputs=1); driven by 12 concurrent test requests.

Phase	Containers
At rest	0 (scaled to zero, $0)
Under load (12 test requests)	12
Idle, within `scaledown_window`	12 (held ~20–24 s)
Idle +32 s → +40 s	12 → 1 → 0

What this proves: Modal's scale-to-zero, fan-out, and drain mechanics work. What the "12" is NOT: the 12 = 12 test requests × max_inputs=1 (one request pins one container). It is the test config, not our renderer's concurrency. Production uses max_inputs=32, where 12 requests ≈ 1 container. This measures nothing about our image, cold start, or real per-GPU concurrency — those need the real image deployed.

4.4 Governing principle — control plane vs render plane#

Modal is a stateless render plane (compute only). The pure-CPU control-plane server owns all state — status, lifecycle, Redis, and the MP4 store (local disk by default; S3 optional).

Concern	Owner
API standardization (`POST /v1/videos`, `GET /{id}`, `GET /{id}/content`)	CPU control-plane server
Job lifecycle, `video_id`, status/progress	CPU control-plane server → Redis (internal)
Render dispatch + spillover (in-house vs Modal)	CPU control-plane server (§16)
MP4 store — control-plane local disk by default (served + TTL-cleaned); S3 optional	CPU control-plane server
GPU rendering (frames → MP4)	Renderer (in-house / Modal) — stateless compute

Because the renderer holds no state, the hard distributed-systems questions disappear: "which container has the mp4?" (none — the control plane stores it on local disk by video_id), sticky sessions (N/A), keeping replicas warm / region pinning (nothing to keep).

Caveat (C2/[D]): the renderer returns the MP4 as the Modal function result. Modal stages large return values through its own object store transparently — so the bytes do transit Modal storage, even though the renderer never touches our Redis or storage. The "renderer touches nothing internal" claim is true of our network only.

5. Cold-start strategy (the core problem)#

Anatomy of our cold start#

§5 — Anatomy of a cold start. [D] Modal docs · [E] estimate (not measured on the real model). The snapshot is meant to skip weight-load + init on subsequent bursts — realised cold start is unmeasured until the real image runs.

The <1s [D] is Modal's floor (warm pool + lazy FS).
~43s [E] weight load — extrapolated, not measured on the real model (see probe below).
engine init [E] — an estimate; this is the largest unknown. The snapshot is meant to skip weight-load + init on subsequent bursts, but the realised cold start is unmeasured until the real image runs.

Scale-to-zero means the cold start lands exactly when a burst hits. Three layers attack it:

Split the 75 GB image — image = SGLang + CUDA only; weights → Modal Volume (pinned); pre-compiled DeepGEMM kernels → a second Volume.
Memory snapshot of the warm server — documented to give 3–10× [D] faster cold starts.
- GPU snapshot is an alpha feature [D] and hardware-specific — see H1/D5.
- @modal.enter(snap=True): start engine, warm it, /release_memory_occupation, snapshot.
- @modal.enter(snap=False): /resume_memory_occupation on restore.
- TORCHINDUCTOR_COMPILE_THREADS=1 (else torch.compile breaks snapshots [D]).
- Warms up over the first ~5 [D] cold starts — per GPU type (H1).
Burst headroom — min_containers=0 + buffer_containers=1 (covers only one concurrent first request, not a burst — see §16); optional predictive pre-warm before known peaks.

Volume read probe — [M-platform], generic, NOT the real model#

[M-platform] Cold read off a Volume: 0.92 GB/s (3.09 GB Qwen safetensors in 3.35 s, single CPU box, single file). This is a generic Volume-throughput probe, not our model on a GPU.
[E] Extrapolated to 40 GB → ~43 s — a rough upper bound, not a measurement.
The snapshot is meant to make this a one-time cost; a real GPU box with sharded parallel load should be faster, toward Modal's advertised 1–2 GB/s [D]. Confirm by measuring the real model.

6. Security & networking ("proxy")#

Inbound auth — function invocation (resolved, validated)#

The render path uses function spawn(), so there is no public endpoint. The control plane authenticates to Modal with a Modal API token (a Secret). No public attack surface, no proxy-auth, no edge-DoS concern.
[M-platform] Validated (2026-06-30): with token-only auth (MODAL_TOKEN_ID/MODAL_TOKEN_SECRET, isolated HOME / no profile file), a stand-in control plane spawn()→get() round-tripped a job — including a 4 MiB result via Modal's blob store — and a wrong token was rejected (AuthError: Token validation failed). Confirms the control-plane→renderer auth path. (Mechanism tested with a stand-in function, not the real renderer.)
For production: mint a dedicated service token via modal token new (don't embed a personal token); store as MODAL_TOKEN_ID/MODAL_TOKEN_SECRET in the control plane's secret manager. Reference patterns: ~/modal_examples/token_render.py (spawn-able renderer), token_client.py (control-plane caller).
(If a web endpoint were ever added it would need requires_proxy_auth=True [D] and would hit the 150s request cap — not used here.)

Outbound — static egress IP (not needed)#

The renderer needs no access to our internal network. modal.Proxy (static IP, Team/Enterprise, ≤5 IPs [D]) is only relevant if that ever changes. Skip.

Knob	Setting	Purpose
`min_containers`	`0` (or small for hot peaks)	Pay nothing idle
`buffer_containers`	`1–2`	Absorb the first spike request(s) — limited (§16)
`max_containers`	set a ceiling	Cost cap / blast-radius limit (D10)
`@modal.concurrent(target_inputs, max_inputs)`	tuned to real per-GPU concurrency (measure)	Requests packed per replica ($/render lever)
`scaledown_window`	e.g. 300 s	How long idle replicas linger before →0

8. Cost model#

Per-second GPU billing. Idle = $0 (scale-to-zero).
Burst cost ≈ (GPU $/s) × (replicas) × (burst duration). max_containers caps it.
Budget: $100k / month [T] — enforce via Modal workspace spend limit + billing alerts; size max_containers so worst-case sustained burst stays within budget. (The actual max_containers ↔ $100k arithmetic is not yet computed — do it once the GPU + $/render are known.)
Cheaper than a second always-on in-house cluster for spiky overflow; not for constant load.
Region pinning premium [D] (Modal pricing docs, not measured): broad 1.5×, narrow 1.75× base. Default = no pin. [M-platform] region= works (us-east → us-east-2, AWS) — note this test itself incurred the 1.5× broad-region premium.

9. Go-live checklist#

Remaining before sign-off (C1/C2 resolved)

Measure the real image: cold start, per-GPU concurrency, GPU-OOM headroom, snapshot speedup
Finalize SGLang flags / max_inputs against the real model

Must-have

Weights provisioning job → boson-weights Volume, pinned (closed-source, private store)
Pre-compiled DeepGEMM kernels → second Volume
Serving modal.Cls: single validated GPU type, GPU snapshot (alpha), release/resume, respawn-on-fail
Auth per C2: Modal API token on control plane (function path) or proxy-auth (web path)
Secrets: private weight-store creds (provisioning job), Modal token, app api-key
max_containers sized to the $100k/mo budget + Modal spend limit/alerts
Control-plane spillover + back-pressure (§16)
Monitoring: GPU-OOM alert + control-plane SLIs (§17)

Nice / depends

dev + prod Modal environments; deploy via CI; pinned image + weights per env
Region pinning (only if latency / data residency demands — pay the premium)

10. Decisions#

Set with the team; D3/D5/D7 revised after the design review; C1/C2 resolved.

#	Decision	Value	Notes
D1	Model source	Closed-source — no HF id	Private weights provisioned once into a Modal Volume from our store (pinned). Store creds = a Secret used only by the provisioning job.
D2	Quantization	FP8 (done)	Constrains GPU to FP8-capable hardware (Hopper/Ada/Blackwell; excludes A100).
D3	GPU	One GPU, single validated type (revised)	GPU snapshots are hardware-specific (H1), so a multi-type list fragments them (one snapshot per type, rarely-used types never warm). Pin one type with validated VRAM headroom (likely H100/H200; L40S marginal — validate, H2). Not the earlier open `["H100","H200","L40S"]` list.
D4	Context length	16k	The token backbone's sequence length; sized into the VRAM floor.
D5	Snapshot / failure	GPU snapshot ON (alpha), no CPU fallback (revised)	`enable_gpu_snapshot=True` is alpha [D]. On snapshot/GPU failure → respawn a fresh container. Accept that the tail (cold, un-snapshotted) pays full cold start.
D6	Warm capacity	`min_containers=0`, `buffer_containers=1`	$0 idle; buffer covers only the first request(s) — burst handling in §16.
D7	Auth	Modal API token on control plane — validated [M-platform]	Render via `spawn()` → no public endpoint; control plane authenticates with an API token (Secret). Token-only `spawn()`→`get()` round-trip confirmed; bad token rejected (§6). Mint a dedicated service token (`modal token new`) for production.
D8	Outbound egress proxy	Not needed	Renderer needs no internal access.
D9	Load distribution	Use Modal's built-in autoscaling + LB	Pods aren't externally addressable; the control plane decides when to spill (§16) and Modal distributes internally.
D10	Cost ceiling	$100k / month	Modal spend limit + alerts; size `max_containers` (arithmetic TBD once $/render known).
D11	Region	Default (no pin)	Avoids the 1.5–1.75× premium.
D12	Environments	`dev` + `prod`, deploy via CI	Pinned image + weights per env.

Remaining gate: real-image validation (cold start, per-GPU concurrency, OOM, $/render) before final sign-off. C1/C2 resolved.

11. References#

SGLang + Modal Snapshots — https://modal.com/docs/examples/sglang_snapshot
Memory Snapshots (incl. GPU snapshot alpha) — https://modal.com/docs/guide/memory-snapshot
Web endpoint timeouts (150s) — https://modal.com/docs/guide/webhook-timeouts
Proxy Auth Tokens — https://modal.com/docs/guide/webhook-proxy-auth
Modal Proxies (static egress IP) — https://modal.com/docs/guide/proxy-ips
Region selection & pricing — https://modal.com/docs/guide/region-selection
GPU metrics / health — https://modal.com/docs/guide/gpu-metrics · https://modal.com/blog/gpu-health
Autoscaling — https://modal.com/docs/guide/scale

12. Prototype artifacts (this workspace)#

In ~/modal_examples/ (workspace bosonai) — all stand-in tests, not the real renderer:

scale_endpoint.py / lb_endpoint.py — synthetic autoscaler + fan-out tests (§4.1/§4.3)
volume_speed.py — generic Volume read probe (0.92 GB/s, §5)
region_endpoint.py — region pin test (§8)
token_render.py / token_client.py — control-plane auth test: token-only spawn()→get() (D7/§6)
vllm_*.py — generic vLLM serving/inference references

13. Implementation runbook (step by step)#

Prereqs: modal CLI authed (workspace bosonai); C1 + C2 resolved.

Step 0 — Secrets: private weight-store creds; Modal API token for the control plane (function path) or proxy-auth tokens (web path); app api-key.

Step 1 — Provision weights → boson-weights Volume from our private store (creds via Secret), pinned to the exact version; volume.commit().

Step 2 — Pre-compile kernels (sglang.compile_deep_gemm) → second Volume (avoids JIT during snapshot).

Step 3 — Build the serving image (engine + CUDA only, NO weights).

Step 4 — First deploy + bake the snapshot — invoke ~5 times per GPU type to bake snapshots.

Step 5 — Measure cold start (real image) — scale to zero, invoke, measure restored cold start. Target a few seconds — unverified until measured.

Step 6 — Wire the control plane — render dispatch via spawn(); spillover + back-pressure (§16).

Step 7 — Guardrails & observability — max_containers; GPU-OOM alert; SLIs (§17).

Step 8 — Promote dev → prod.

14. Reference skeleton — `renderer.py` (illustrative, NOT tested)#

⚠️ Reflects the resolved design: renderer = async video job returning MP4, invoked via spawn() (function timeout, not the 150s web limit), single GPU type, GPU snapshot (alpha). SGLang is the internal token backbone; exact flags/values finalize against the real model.

import os, subprocess
import modal

MODEL_PATH = "/weights/boson-avatar"     # D1 — closed-source, provisioned into Volume
PORT = 30000

image = (
    modal.Image.from_registry("nvidia/cuda:12.x-devel-ubuntu22.04", add_python="3.12")
    .pip_install("sglang[all]==<pin>")
    .env({"TORCHINDUCTOR_COMPILE_THREADS": "1", "SGLANG_ENABLE_JIT_DEEPGEMM": "1"})
    # .run_function(precompile_deepgemm, volumes={...})   # Step 2
)
weights = modal.Volume.from_name("boson-weights", create_if_missing=True)
kernels = modal.Volume.from_name("sglang-kernels", create_if_missing=True)
app = modal.App("boson-renderer-overflow")

@app.cls(
    image=image,
    gpu="H100",                                       # D3 — ONE validated GPU type (not a list)
    volumes={"/weights": weights, "/kernels": kernels},
    enable_memory_snapshot=True,                       # D5 — alpha; no CPU fallback, respawn on fail
    experimental_options={"enable_gpu_snapshot": True},
    min_containers=0, buffer_containers=1,             # D6
    max_containers=int(os.getenv("MAX_CONTAINERS", "...")),  # D10 — size to $100k/mo
    scaledown_window=300,
    timeout=20 * 60,                                   # function timeout (NOT the 150s web limit)
)
@modal.concurrent(max_inputs=8)                        # MEASURE real per-GPU concurrency first
class Renderer:
    @modal.enter(snap=True)
    def start_and_warm(self):
        cmd = ["python", "-m", "sglang.launch_server", "--model-path", MODEL_PATH,
               "--host", "0.0.0.0", "--port", str(PORT),
               "--quantization", "fp8", "--context-length", "16384"]   # D2/D4
        self.proc = subprocess.Popen(cmd)
        # _wait_until_healthy(); _warmup(); _post("/release_memory_occupation")

    @modal.enter(snap=False)
    def resume(self):
        ...  # _post("/resume_memory_occupation")

    @modal.method()                                    # C2 — called via .spawn(); NOT a web endpoint
    def render(self, req) -> bytes:
        import os, tempfile
        out = os.path.join(tempfile.gettempdir(), f"{req['video_id']}.mp4")
        try:
            self._run_inference(req, out)              # render -> MP4 cached on RENDERER local disk
            with open(out, "rb") as f:
                return f.read()                        # transmit to control plane (function result)
        finally:
            if os.path.exists(out):
                os.remove(out)                         # renderer cleans its local MP4 after handoff

Control plane (their CPU server, using the Modal SDK):

call = Renderer().render.spawn(req)        # non-blocking; returns a FunctionCall
# ... poll call (FunctionCall.from_id) or await result; then write bytes to LOCAL DISK + update Redis

15. Async render flow (maps to Boson video API)#

A pure-CPU control-plane server sits between the gateway and the GPU renderers. In the internal network (reaches internal Redis; MP4 on local disk by default), it standardizes the Boson video API, tracks status in Redis, dispatches renders, and stores the MP4 on local disk (served from there, regularly TTL-cleaned). S3 is optional — see "Storage" note below.

API (unchanged): POST /v1/videos → {id, status:"queued", progress, ...}; GET /v1/videos/{id}; GET /v1/videos/{id}/content. Status ∈ queued|in_progress|completed|failed.

§15 — Async render flow. The CPU control plane owns Redis + the local MP4 store (S3 optional); the Modal renderer only returns MP4 bytes. The renderer never reaches internal state — no callback, no VPC.

Storage — Volume-only by default, S3 optional.

Weights: Modal Volume (provisioned once); no S3 at runtime.
Output MP4: returned to the control plane and written to its local disk, served from there, and regularly cleaned by a TTL job (M9 LocalBlobStore.cleanup()).
Renderer-side cleanup: the renderer caches the MP4 on its own disk only until it's transmitted, then deletes it per-job — so warm GPU containers never accumulate output. Two cleanups: renderer (immediate, per-job) + control plane (TTL on served files).
S3 is optional — add it only if you want (a) offloaded CDN/presigned-URL downloads, or (b) a horizontally-scaled control plane without a shared disk (otherwise local disk / a shared NFS-EFS mount is enough). Single control-plane server → local disk is sufficient.

Why function-spawn() and not a web request (C2): a Modal web endpoint caps a request at 150 s [D] — a multi-minute render can't return bytes synchronously over HTTP. spawn() uses the function timeout (up to hours), and the control plane retrieves the result via FunctionCall. The renderer never touches our Redis or storage; large MP4 returns transit Modal's object store transparently.

Progress granularity (honest): coarse status (queued→in_progress→completed) is free from the spawn lifecycle. Live progress 0–100 would require the renderer to emit progress events to the control plane — an outbound call that reintroduces a renderer→control-plane dependency. So: default to coarse status; only add live progress if the product requires it, and accept the extra channel.

16. Spillover & back-pressure (design)#

The core value prop — and the part most easily hand-waved. Concretely:

Saturation signal (pick one the gateway already emits): in-house queue depth (preferred) or in-house GPU utilization or p99 TTFT. The control plane reads it.
Hysteresis: spill to Modal above a high-watermark; stop spilling below a low-watermark (drain back to in-house). Two thresholds, not one, to avoid flapping.
Cold-start-on-burst (the hard case): we spill because in-house is saturated, but Modal is then cold (min_containers=0) → the first spilled requests eat the worst cold start, and buffer_containers=1 covers only ~one in-flight request. Options, by cost:
1. Accept the slow first burst (cheapest) — fine if the SLA tolerates it.
2. Predictive pre-warm: raise min_containers on a schedule before known peaks.
3. Standing warm pool: min_containers≥1 during expected-burst windows (costs idle GPU).
max_containers hit (budget cap, D10): define overflow-of-overflow behaviour — queue in the control plane (with a timeout) or shed (return 503/failed). Don't leave it undefined.
Drain-back: when in-house recovers, route new jobs back; let Modal scale to zero.

16.1 Integrating with the weighted gateway (the full picture)#

The gateway already does weighted scheduling. Keep weighted round-robin within in-house; add Modal as one logical overflow endpoint (you can't address its pods — Modal load-balances internally). Three signals drive it:

Signal	Source
When to spill (in-house saturated)	in-house workers' load (busy ratio / their `max_conns`) — with hysteresis (illustrative defaults `0.85`/`0.60` — tune with real data, not measured) so it doesn't flap
How much Modal can absorb right now	Modal API — `Function.get_current_stats()` → `num_total_runners`, `num_running_inputs`, `backlog`. Live free slots = `runners × max_inputs − running`
Hard ceiling	`max_conns = max_inputs × max_containers` (per D10 budget)

"Is Modal full?" Modal never returns a 429 — it queues. You detect it: num_total_runners ≥ max_containers and no free slots (or backlog rising) → set weight 0 / shed. The control plane's own in-flight-vs-cap counter (M8) is the authoritative, instant version; get_current_stats() is the live cross-check that drives the weight.

Two ways to wire it (pick by your LB's features):

Static (simplest, no controller): give each in-house endpoint a max_conns; add Modal as a backup / priority-1 member with max_conns = max_inputs × max_containers. The LB sends to Modal only when all in-house are at capacity, and sheds (503) past Modal's cap. Weights stay exactly as they are for in-house.
Dynamic weight (sidecar — M12): a small process polls get_current_stats() every ~3 s, computes Modal's live free capacity, and sets Modal's weight via the LB admin API (HAProxy Runtime API / nginx Plus API / Envoy EDS): 0 when in-house has room or Modal is full; a small warm-up weight to wake a cold tier; ramps with real capacity as Modal scales; hysteresis + fail-safe (drop to 0 if the API can't be reached). Reference impl: m12_gateway_sidecar.py (validated, code-reviewed).

Net behaviour: in-house carries everything it can (weights unchanged); Modal absorbs only the spillover, sized to its real live capacity; past the cap you return a clean "busy, retry" instead of letting requests rot in a queue.

17. Monitoring & machine health#

Do we manage Modal machine health? Almost entirely no — that's the point of serverless. Modal runs continuous GPU health checks across its fleet, drains/replaces bad GPUs, and reschedules containers on failure (gpu-health). No DCGM / node-problem- detector / nvidia-smi babysitting; no nodes to patch. Our D5 "respawn on failure" is largely what Modal already does.

What we DO monitor — at the function/app level:

Layer	Watch	Where
Modal-side	GPU memory (OOM!), GPU util/power, container counts, cold-start time, failure/retry rate	Modal dashboard + GPU metrics; abnormal GPU events go to container logs
Control plane	render success/fail rate, render latency, queue depth / spillover rate, cold-start p99	our metrics + Redis (already tracks jobs)
Cost	$/day vs the $100k/mo budget	Modal spend limit + billing alerts

Modal GPU metrics are container-level aggregated (can't isolate one bad GPU among 8) — irrelevant for us (1 GPU/container).
Given H2 (GPU-OOM risk on a marginal card), GPU memory utilization is the one Modal-side metric to alert on — an OOM is the most likely silent failure.
Export Modal logs/metrics to Datadog/Grafana if you want a unified view; otherwise the dashboard
- control-plane SLIs suffice.

Bottom line: don't build machine-health monitoring. Put observability in the control plane (render success/latency/spillover/cost) plus a GPU-OOM + failure-rate alert on Modal.

18. Prewarm & readiness gate#

2026-07-07 update — measured outcomes on the real workload (v1.0.0): the effective prewarm mechanism in production is the self-warmup readiness gate (fast-boot + one private render before the port opens → 405 s cold→traffic-ready, $0 idle; milestones/elastic_avatar.py). Memory snapshots (§18.4–18.5) did NOT survive contact with the real image (creation >2.6 h, never restored) and compile-cache Volumes don't work for this build (confirmed twice). Sub-minute absorption = warm replica (min_containers=1). Full data: elastic.html.

API reference: the renderer exposes 74 production routes (/v1/videos, /avatar/video/*, /wan_s2v/*, …). Full call examples with real captured responses: API_EXAMPLES.md · surface + config notes: STREAMING_FINDINGS.md.

Prewarm isn't one script — it's a stack of readiness layers, each with its own lifecycle and fingerprint, provisioned once and reused. The provisioner is an idempotent readiness gate the control plane consults before routing to Modal.

S3 elimination (validated 2026-07-01). The renderer needs no S3 at runtime — S3 was only a one-time weight-provisioning source. The clean end state bakes the DiffSynth upstream weights into a derived image (run_function copies them into /workspace/models/… at build time), so the serving container mounts no Volume, wires no symlinks, and touches no S3 — it just reads its own filesystem. Verified: healthz ✅ + a real render on the baked image with zero external deps (wan_baked.py). This collapses L1 into L0 (bake at build). (The boson image maintainer should ideally bake these into the published image directly.)

18.1 Readiness layers#

Layer	What	Where	Keyed by	Established
L0 Image	code + CUDA + engine + baked generators	Modal image store	image tag	build once (Modal caches)
L1 Weights	upstream files (T5/VAE/wav2vec/umt5)	Volume (or baked)	model version	provision from S3 once
L2 Compile cache	Inductor/Triton kernels	Volume (or baked)	image + GPU type + torch	warm once — but see 18.4
L3 Snapshot	warm+compiled server memory	Modal memory snapshot	image + GPU + model	capture once
L4 Warm pool	running containers	`min/buffer_containers`	—	continuous (idle $)

18.2 The provisioner = idempotent readiness gate#

ensure_ready(spec) -> {ready: bool, missing: [...], eta_s: N}
    fp = fingerprint(image_tag, model_ver, gpu_type, torch_ver)
    if not weights_ready(model_ver): spawn provision_weights()   # L1, async
    if not warm_ready(fp):           spawn warm(fp)              # L3, async
    return {ready: True} if nothing missing else {ready: False, missing, eta_s}

Idempotent — once warm, re-calling is an instant no-op (each layer has a sentinel in its Volume / a modal.Dict). Fingerprint invalidates only the changed layer (new image → L0/L2/L3; new model → L1/L3; new GPU/torch → L2/L3).

18.3 The e2e journey (gateway → provisioner)#

gateway: burst predicted / in-house saturating
  → control-plane.ensure_ready(model=v, gpu=H100)
       provisioner checks the manifest; warms only the missing layers (async)
       → {ready:false, missing:[...], eta:~Nm}
  → gateway: KEEP traffic in-house; poll ensure_ready()      ← readiness is a GATE
  → when {ready:true}: route overflow to Modal

Rule: never spill to a cold Modal tier for latency-sensitive traffic — route only after the gate is green (or accept the cold start for pure best-effort overflow). Trigger order, best→worst: bake at image-build > predictive (scheduled/util-watermark) > lazy (on first spill).

18.4 Measured reality for the wan-s2v server (2026-07-01, H100) — L2 is a no-op here#

Cold start ≈ 130 s to /health (model load + startup warmup) — dominated by model load, not compile.
The server's torch.compile (max-autotune) does not produce a meaningful persistent disk cache — TORCHINDUCTOR_CACHE_DIR/TRITON_CACHE_DIR stayed 0 MB, /root/.cache ≈ 1.2 MB after a full warmup. So persisting a compile cache to a Volume does not help this server.
Config fix regardless: set WAN_SGLANG_WARMUP=1 so the compile runs cleanly at startup (default 0 defers it to first render → the earlier web-endpoint 303 flakiness).
So the effective prewarm lever here is L3 (memory snapshot) + L4 (warm pool), not L2. L2 is genuinely empty for this server — a good example of why the readiness gate is per-layer, per-server: measure each layer; don't assume a compile cache exists.

18.5 Measured cold-start comparison — L3 memory snapshot (2026-07-01, H100, GPU snapshot alpha)#

3 cold starts of a snapshot-enabled modal.Cls (server booted under @modal.enter(snap=True) with TORCHINDUCTOR_COMPILE_THREADS=1), scaled to zero between calls:

Cold start	client-side time	note
Baseline (no snapshot)	~130–152 s	model load + warmup
Snapshot restore (hit)	56.8 s	~2.5× faster — restore works
Snapshot creation (first boot)	319.6 s	one-time overhead per worker type
Re-bake (unmatched worker)	1213 s	⚠️ landed on a worker without a matching snapshot → full re-create, very slow

Verdict: the memory snapshot does cut cold start ~2.5× when restore hits (57 s vs ~130 s), but GPU snapshots are alpha + worker-type-specific and inconsistent — some cold starts re-bake (20 min here). So L3 alone is not reliable for this server yet. The deterministic prewarm is L4 (warm pool): min_containers=1 (or buffer_containers) keeps a container hot → 0 cold start on the served path, at the cost of one idle H100. Recommended posture:

Latency-critical / known-burst windows: min_containers≥1 (warm pool) — deterministic.
Best-effort overflow: accept the ~130 s cold start, or enable snapshots and tolerate the occasional re-bake, until GPU snapshots leave alpha.
Always set WAN_SGLANG_WARMUP=1 so the compile happens cleanly at startup regardless.

Appendix — Provenance of figures#

Provenance of figures. Every number is tagged so none is taken on faith:

[M-platform] — Modal platform behaviour we observed this session using a synthetic stand-in container (a cheap debian_slim CPU container with a time.sleep() as fake work). It validates Modal's scheduler/autoscaler, not our real image, model, GPU, or render.

[D] — Modal published docs. · [T] — team-provided. · [E] — estimate (illustrative).

⚠️ No real-workload numbers exist yet. Cold-start time, per-GPU render concurrency, throughput, GPU-OOM behaviour, snapshot speedup, and $/render all require deploying the real 75 GB image on a GPU — not yet done. Every [M-platform] figure reflects Modal mechanics with a stand-in, never the renderer.

Tag inventory:

[M-platform]: scale 0→12→0, 15-req→9-container fan-out, region pin us-east→us-east-2, 0.92 GB/s Volume read (3 GB file, CPU box), token-only spawn()→get() auth round-trip (+ bad-token rejected).

[T]: 75 GB image, ~40 GB VRAM, FP8, 16k context, $100k/mo budget.

[D]: "<1s" container start, 3–10× snapshot speedup, 1.5×/1.75× region premium, 1–2 GB/s Volume, ≤5 proxy IPs, 150s web timeout.

[E]: §2 K8s-vs-Modal cold-start ranges, SGLang init time, the ~43s weight-load extrapolation.

Modal Burst-Overflow Scalability Plan

Modal Burst-Overflow Scalability Plan

1. Goal#

Progress tracker#

2. What Modal actually is (architecture)#

3. Our workload facts#

4. Target architecture#

4.1 Modal load-balancing mechanics (validated with a stand-in container)#

4.2 Where load-balancing behavior is defined#

4.3 Scale up/down — [M-platform] synthetic autoscaler test#

4.4 Governing principle — control plane vs render plane#

5. Cold-start strategy (the core problem)#

Anatomy of our cold start#

Volume read probe — [M-platform], generic, NOT the real model#

6. Security & networking ("proxy")#

Inbound auth — function invocation (resolved, validated)#

Outbound — static egress IP (not needed)#

7. Autoscaling knobs (Modal primitives, not K8s)#

8. Cost model#

9. Go-live checklist#

10. Decisions#

11. References#

12. Prototype artifacts (this workspace)#

13. Implementation runbook (step by step)#

14. Reference skeleton — renderer.py (illustrative, NOT tested)#

15. Async render flow (maps to Boson video API)#

16. Spillover & back-pressure (design)#

16.1 Integrating with the weighted gateway (the full picture)#

17. Monitoring & machine health#

18. Prewarm & readiness gate#

18.1 Readiness layers#

18.2 The provisioner = idempotent readiness gate#

18.3 The e2e journey (gateway → provisioner)#

18.4 Measured reality for the wan-s2v server (2026-07-01, H100) — L2 is a no-op here#

18.5 Measured cold-start comparison — L3 memory snapshot (2026-07-01, H100, GPU snapshot alpha)#

Appendix — Provenance of figures#

14. Reference skeleton — `renderer.py` (illustrative, NOT tested)#