Modal Burst-Overflow Scalability Plan

Owner: erik@boson.ai Last updated: 2026-06-16 Status: Implementation-ready — all §10 decisions locked

Provenance of figures. Numbers are tagged by source so none are taken on faith: [M] measured by us on Modal this session · [D] Modal published docs · [T] team-provided · [E] estimate (illustrative, not measured — flagged inline). • [M]: 0.92 GB/s volume read, 0→12→0 autoscale, 15-req→9-container fan-out, region pin us-east→us-east-2. • [T]: 75 GB image, ~40 GB VRAM, FP8, 16k context, $100k/mo budget. • [D]: "<1s" start, 3–10× snapshot speedup, 1.5×/1.75× region premium, 1–2 GB/s volume, ≤5 proxy IPs. • [E]: the §2 K8s-vs-Modal cold-start ranges and the SGLang init time — directional only.

1. Goal#

Use Modal as a burst-overflow inference tier, not the primary serving layer.

In-house GPU cluster handles stable/baseline traffic.
Modal absorbs spikes only, then scales to zero so we pay nothing when idle.
Model is API-compatible (OpenAI schema via SGLang) so our gateway can treat in-house and Modal as interchangeable backends.

Non-goals: running 24/7 max-throughput on Modal (in-house is cheaper for constant load); training (this doc is inference-serving only).

Modal is not Kubernetes. It is a custom serverless GPU platform:

Layer	Implementation
Orchestration	Custom multi-cloud scheduler in Rust (deliberately not Docker/K8s)
Container runtime	Custom Rust runtime, sub-second container start
Isolation	gVisor (multi-tenant sandbox; first to run gVisor w/ GPUs)
Filesystem	ImageFS — lazy-loading, content-addressed, FUSE. Images stream on demand
Hardware	Bare-metal GPU instances pooled across clouds (e.g. Oracle OCI bare metal)

Implications for us:

No nodes / pods / cluster / HPA to manage. Scaling = Modal primitives, not K8s objects.
Lazy content-addressed FS → a 75 GB image does not fully download on boot; only files actually read are fetched + cached. Code cold-start is fast; weight loading is the real cost.
gVisor adds a thin syscall-compat layer — negligible for SGLang inference.

Why container start is "<1s" (and what it excludes). The <1s is the runtime bringing up the container process on an already-warm bare-metal machine. Modal hits it by deleting the three things that make a "normal" cold start slow:

Normal slow step	Typical time [E]	Modal's trick
Provision a VM / add a K8s node	30 s – 5 min	Warm bare-metal pool, GPUs pre-attached → ~0
`docker pull` (extract all layers first)	30 s – min (multi-GB GPU images)	ImageFS lazy mount — start now, stream files on demand
Container runtime setup	0.1 – 1 s	Rust runtime + gVisor, optimized hot path

The time ranges above and the 2–10 min K8s figure below are [E] typical industry estimates, illustrative only — not measured by us, not published by Modal. They convey the direction (Modal removes the node-provision and image-pull waits), not benchmarked values.

The "<1s" Modal start [D] excludes loading our model weights. A GPU pod scaling from zero on K8s is typically 2–10 min [E] end-to-end; Modal collapses that to seconds — but our service cold start is still gated by weight load + SGLang init (see §5).

3. Our workload facts#

Fact	Value
Image size (weights + SGLang + CUDA)	~75 GB
Model	Closed-source, FP8 (no HF id; weights from our private store)
Runtime VRAM	~40 GB (FP8 model + 16k KV cache) → fits on one GPU
Serving engine	SGLang
GPU	Any single FP8-capable GPU with enough VRAM (H100 / H200 / L40S …) — take whatever Modal has (D3)
GPU count	1 (keeps us on the supported single-GPU snapshot path)

4. Target architecture#

§4 — Target architecture. A pure-CPU control-plane server (co-located with the gateway, in the internal network) owns the API, Redis status, S3 upload, and render dispatch. The GPU renderer is interchangeable, stateless compute behind it.

A pure-CPU control-plane server (co-located with the gateway, inside the internal network with access to internal Redis and public S3) owns API standardization, job status, the MP4 upload, and the spillover decision. The GPU renderer — in-house or Modal — is interchangeable, stateless compute behind it.

4.1 The endpoint: one URL, load-balanced across replicas (verified)#

A Modal web endpoint is one stable URL, not a per-machine address:

§4.1 — One stable URL, load-balanced across replicas. Verified: 15 requests → 9 containers across AWS + Azure + GCP. Individual containers are not addressable.

URL is deterministic & stable: https://<workspace>--<app>-<function>.modal.run; unchanged across redeploys. Our gateway points at this single URL.
Verified fan-out: 15 concurrent requests to one URL were served by 9 distinct containers spanning AWS + Azure + GCP (no region pin → full multi-cloud pool).
Per-instance URLs? No. Individual containers are not addressable by default — there is no container-1.modal.run. The only way to reach a specific container is modal.forward() / Tunnels (an advanced feature for direct port exposure, e.g. distributed training) — not used here.

4.2 Where load-balancing behavior is defined#

Modal owns the routing; you own the knobs — there is no custom routing/LB config to write.

Concern	Who controls it	How
Request → container routing	Modal proxy (automatic)	Routes to any container with spare capacity. No rules to define.
How many requests per container before fanning out	You	`@modal.concurrent(target_inputs=…, max_inputs=…)` — `target_inputs` = autoscaler goal, `max_inputs` = hard ceiling (containers burst to this if scale-up lags)
How many containers exist	You	`min_containers`, `buffer_containers`, `max_containers`, `scaledown_window`
Session affinity / sticky routing	Not available in Modal	Routing is stateless. If we ever need affinity or custom LB, it goes in our gateway in front of Modal, or via separate named apps.
Region of placement	You (optional)	`region=` (adds 1.5–1.75× cost; default = unpinned multi-cloud pool)

For our stateless OpenAI-style inference, the default stateless LB is exactly right — we only tune @modal.concurrent (throughput/$ per replica) and the autoscaler bounds.

4.3 Scale up/down (measured 2026-06-29)#

Tested with min_containers=0, scaledown_window=20, @modal.concurrent(max_inputs=1), driving 12 concurrent requests at the single URL:

Phase	Containers
At rest	0 (scaled to zero, $0)
Under load (12 concurrent)	12 (one per request, `max_inputs=1`)
Idle, within `scaledown_window`	12 (held ~20–24s after last request)
Idle +32s → +40s	12 → 1 → 0

Confirms the overflow economics: replicas exist only during the burst + a short cooldown, then drain to zero. scaledown_window tunes how long warm replicas linger (latency vs cost).

4.4 Governing principle — control plane vs render plane#

Modal is a stateless render plane (compute only). The upper layer — a pure-CPU control-plane server co-located with the gateway — owns all state. Modal scales the renderer; status, lifecycle, Redis, and S3 stay above it.

Concern	Owner
API standardization (`POST /v1/videos`, `GET /{id}`, `GET /{id}/content`)	CPU control-plane server
Job lifecycle, `video_id`, status/progress	CPU control-plane server → Redis (internal)
Render dispatch + spillover (in-house vs Modal)	CPU control-plane server
MP4 upload to S3 (+ presigned reads)	CPU control-plane server (public S3 access)
GPU rendering (frames → MP4 bytes)	Renderer (in-house / Modal) — pure compute, stateless

Because the renderer holds no state, the hard distributed-systems questions disappear: "which container has the mp4?" (none — the CPU control plane uploads it to S3 by video_id), sticky sessions (N/A — fire-and-forget), keeping replicas warm / region pinning (nothing to keep). The renderer touches nothing internal — it just returns the MP4 bytes to the CPU control plane, which (already inside the internal network) owns Redis and S3. So there is no HTTPS callback, no egress proxy, and no VPC on the Modal side.

5. Cold-start strategy (the core problem)#

Anatomy of our cold start (what we're actually fighting)#

container start (<1s [D])  +  weight load (~43s [M-derived])  +  SGLang init/compile (10s–min [E])

The <1s [D] is Modal's floor (warm pool + lazy FS). We don't optimize this.
The ~43s [M] weight load is derived from our measured 0.92 GB/s × 40 GB (see below); the init time is [E] — an estimate to confirm at bring-up. The snapshot captures the warm server so a restored cold start skips both → back toward the few-seconds range.

Scale-to-zero means the cold start lands exactly when a burst hits. Three layers fix it:

Split the 75 GB image
- Image = SGLang + CUDA only (few GB).
- Weights → Modal Volume (hf-cache), pinned to an exact revision.
- Pre-compiled kernels (DeepGEMM) → a second Volume, built at image-build time.
Memory snapshot of the warm SGLang server (officially supported; 3–10× faster [D])
- Single GPU → enable_memory_snapshot=True + experimental_options={"enable_gpu_snapshot": True}.
- @modal.enter(snap=True): start SGLang, warm it, /release_memory_occupation, then snapshot.
- @modal.enter(snap=False): /resume_memory_occupation on restore.
- Env: TORCHINDUCTOR_COMPILE_THREADS=1 (else torch.compile breaks snapshots).
- Note: snapshot "warms up" over the first ~5 cold starts, not instantly on day one.
Burst headroom
- min_containers=0 (pay nothing idle) + buffer_containers=1 (first spike request doesn't eat a full cold start).
- Optional predictive pre-warm: bump min_containers on a schedule before known peaks.

Measured Volume load speed (benchmarked 2026-06-16)#

[M] Cold read off hf-cache: 0.92 GB/s (3.09 GB in 3.35 s, single CPU box, single file).
[M-derived] Extrapolated to 40 GB: ~43 s worst-case storage read.
This ~43 s is the one-time cost (snapshot build + first few boots); the snapshot skips it on every subsequent burst. Real serving box (modern GPU + sharded parallel load) should beat this, toward Modal's advertised 1–2 GB/s [D].

6. Security & networking ("proxy")#

Two distinct concerns:

Inbound — protect the endpoint (REQUIRED)#

A public GPU endpoint that cold-starts on request is a cost/DoS risk.

Use Proxy Auth Tokens: requires_proxy_auth=True on the web endpoint (D7).
The CPU control-plane server is the only caller — it holds the Modal-Key + Modal-Secret and sends them on every render request.
Modal rejects unauthorized requests at its edge — no container spins up → junk traffic is free. Strictly better than app-level --api-key (which rejects only after boot).
Keep the SGLang --api-key as defense-in-depth.

Outbound — static egress IP (ONLY IF NEEDED)#

Needed only if weights/data live behind our firewall (private registry, internal S3, DB).
modal.Proxy (WireGuard, static IP to allowlist). Team/Enterprise plan, ≤5 IPs.
Skip if weights come from Hugging Face or IAM-keyed public S3.

Knob	Setting	Purpose
`min_containers`	`0` (or small for hot peaks)	Pay nothing idle
`buffer_containers`	`1–2`	Absorb first spike without cold start
`max_containers`	set a ceiling	Cost cap / blast-radius limit
`@modal.concurrent(max_inputs=N)`	tuned	Requests packed per replica (SGLang batches → $/token lever)
`scaledown_window`	e.g. 300 s	How long idle replicas linger before →0

8. Cost model#

Per-second GPU billing. Idle = $0 (scale-to-zero).
Burst cost ≈ (GPU $/s) × (replicas) × (burst duration). max_containers caps it.
Budget: $100k / month (D10) — enforce via Modal workspace spend limit + billing alerts; size max_containers so worst-case sustained burst stays within budget.
Cheaper than a second always-on in-house cluster for spiky overflow; not for constant load.
Region pinning premium [D] (Modal pricing docs, not measured by us): broad region 1.5×, narrow region 1.75× base cost. Default = no pin. What we did verify live [M]: region= works — region="us-east" → landed us-east-2 (AWS).

9. Go-live checklist#

Must-have

Weights provisioning job → boson-weights Volume, pinned (closed-source, private store)
Pre-compiled DeepGEMM kernels → second Volume
SGLang serving modal.Cls: any single FP8-capable GPU, GPU snapshot, release/resume, respawn-on-fail
requires_proxy_auth=True + proxy-auth tokens issued to the control-plane server
Secrets: private weight-store creds (provisioning job), SGLang api-key, proxy-auth token
max_containers sized to the $100k/mo budget + Modal spend limit/alerts
Health/readiness probe (post snapshot-resume) for the control server
Control-server spillover: saturation signal + hysteresis + drain-back; forward overflow to the Modal URL

Nice / depends

dev + prod Modal environments; modal deploy in CI; pinned image + model revisions
Observability: export logs/metrics; track spillover rate, cold-start p99, $/burst
Region pinning (latency / data residency)
Static-IP modal.Proxy (only if weights/data behind firewall)

10. Decisions (locked)#

All decisions resolved with the team — values below are final.

#	Decision	Value	Notes
D1	Model source	Closed-source — no HF id	Private weights provisioned once into a Modal Volume from our own store (pinned). Store creds are a Secret used only by the provisioning job; the renderer reads weights from the Volume.
D2	Quantization	FP8 (done)	Model already FP8. Constrains GPU choice to FP8-capable hardware (D3).
D3	GPU	Any single FP8-capable GPU with enough VRAM — no fixed type	Take whatever Modal has via a preference list (e.g. `gpu=["H100","H200","L40S"]`), ordered by availability/cost; exclude non-FP8 parts (e.g. A100). Still 1 GPU (keeps single-GPU snapshot path). VRAM floor = FP8 model + 16k KV cache (≈ the ~40 GB runtime + headroom). Caveat: GPU snapshots are hardware-specific → expect one snapshot baked per GPU type in the pool (validate; narrow the list if it hurts).
D4	Context length	16k (`--context-length 16384`)	Size KV cache for 16k; included in the D3 VRAM floor.
D5	Snapshot / failure	GPU snapshot ON, no CPU fallback	`enable_gpu_snapshot=True`. On snapshot/GPU failure: terminate and respawn a fresh container — no degraded CPU path.
D6	Warm capacity	`min_containers=0`, `buffer_containers=1`	Pure overflow = $0 idle; buffer absorbs the first spike.
D7	Inbound auth	`requires_proxy_auth=True`; tokens held by the control-plane server	The Modal URL is public, but only the CPU control server calls it, so it carries the `Modal-Key`/`Modal-Secret`. Unauthorized traffic is rejected at Modal's edge — no GPU spin-up.
D8	Outbound egress proxy	Not needed	Renderer needs no internal access; the control server reaches the renderer.
D9	Load distribution	Use Modal's built-in autoscaling + LB	Individual pods aren't externally addressable, so we send overflow to the single Modal endpoint and let Modal scale/route internally. The control server decides when to spill (in-house saturated) and forwards to the Modal URL.
D10	Cost ceiling	$100k / month	Enforce via Modal workspace spend limit + billing alerts; size `max_containers` so worst-case sustained burst stays within budget.
D11	Region	Default (no pin)	Avoids the 1.5–1.75× premium; maximum availability.
D12	Environments	`dev` + `prod`, deploy via CI	Pinned image + weights per env.

✅ All decisions locked — ready to implement.

11. References#

SGLang + Modal Snapshots example — https://modal.com/docs/examples/sglang_snapshot
Memory Snapshots guide — https://modal.com/docs/guide/memory-snapshot
GPU mem snapshots (blog) — https://modal.com/blog/gpu-mem-snapshots
Proxy Auth Tokens — https://modal.com/docs/guide/webhook-proxy-auth
Modal Proxies (static egress IP) — https://modal.com/docs/guide/proxy-ips
Autoscaling — https://modal.com/docs/guide/scale
High-performance LLM inference — https://modal.com/docs/guide/high-performance-llm-inference

12. Prototype artifacts (this workspace)#

In ~/modal_examples/ (workspace bosonai):

vllm_server.py — OpenAI-compatible server pattern (vLLM; SGLang version TBD)
volume_speed.py — the Volume load-speed benchmark (0.92 GB/s result above)
vllm_infer.py — offline batch inference reference
Volume hf-cache — holds test weights

13. Implementation runbook (step by step)#

Prereqs: modal CLI authed (workspace bosonai). Fill the §10 <TODO>s first.

Step 0 — Secrets

modal secret create hf-token            HF_TOKEN=<token>           # if gated/private model
modal secret create sglang-api-key      SGLANG_API_KEY=<random>    # app-level key (defense in depth)
# Proxy-auth tokens are created in the Modal dashboard (Settings → Proxy Auth Tokens),
# then handed to the gateway as Modal-Key / Modal-Secret.

Step 1 — Provision weights into the Volume (one-time job)

A throwaway Modal Function that pulls the closed-source weights from our private store (creds via a Secret) into the boson-weights Volume, then volume.commit().
Pin the exact version so Modal serves the same weights as in-house (D1).

Step 2 — Pre-compile kernels into a second Volume

At image build time run sglang.compile_deep_gemm → write artifacts to a kernels Volume.
Avoids JIT during snapshot creation.

Step 3 — Build the serving image (SGLang + CUDA only, NO weights — see §14 skeleton).

Step 4 — First deploy + warm the snapshot

modal deploy sglang_server.py

Hit the endpoint ~5 times to let the snapshot "bake" (warm-up over first few cold starts).

Step 5 — Validate cold start

Let it scale to zero (wait > scaledown_window), then send one request and measure TTFT.
Target: a few seconds restored (vs tens-of-seconds–minutes unsnapshotted).

Step 6 — Wire the gateway router

Add the Modal endpoint as an OpenAI backend with requires_proxy_auth headers.
Implement spillover (D9): saturation signal + hysteresis + drain-back to in-house.

Step 7 — Guardrails & observability

Set max_containers (D10). Export logs/metrics. Track spillover rate, cold-start p99, $/burst.

Step 8 — Promote dev → prod (separate Modal environments, deploy via CI).

14. Reference skeleton — `sglang_server.py` (to be filled)#

import os, subprocess
import modal

MODEL_PATH = "/weights/boson-avatar"     # D1 — closed-source, provisioned into Volume (no HF id)
PORT = 30000                              # SGLang default

image = (
    modal.Image.from_registry("nvidia/cuda:12.x-devel-ubuntu22.04", add_python="3.12")
    .pip_install("sglang[all]==<TODO>", "huggingface_hub[hf_transfer]")
    .env({
        "HF_HUB_ENABLE_HF_TRANSFER": "1",
        "TORCHINDUCTOR_COMPILE_THREADS": "1",   # required for snapshot + torch.compile
        "SGLANG_ENABLE_JIT_DEEPGEMM": "1",
    })
    # .run_function(precompile_deepgemm, volumes={...})   # Step 2
)

weights  = modal.Volume.from_name("boson-weights", create_if_missing=True)  # D1 private weights
kernels  = modal.Volume.from_name("sglang-kernels", create_if_missing=True)
app = modal.App("boson-sglang-overflow")

@app.cls(
    image=image,
    gpu=["H100", "H200", "L40S"],                     # D3 — any single FP8-capable GPU w/ enough VRAM
    volumes={"/weights": weights, "/kernels": kernels},
    secrets=[modal.Secret.from_name("sglang-api-key")],  # D1 weights live in the Volume, no model secret
    enable_memory_snapshot=True,                       # D5 (GPU snapshot; no CPU fallback — respawn on fail)
    experimental_options={"enable_gpu_snapshot": True},
    min_containers=0, buffer_containers=1,             # D6
    max_containers=int(os.getenv("MAX_CONTAINERS", "...")),  # D10 — sized to the $100k/mo budget
    scaledown_window=300,
    timeout=20 * 60,
)
@modal.concurrent(max_inputs=32)                       # tune to KV headroom
class SGLangServer:
    @modal.enter(snap=True)
    def start_and_warm(self):
        # launch sglang server -> wait ready -> warm requests -> /release_memory_occupation
        cmd = [
            "python", "-m", "sglang.launch_server",
            "--model-path", MODEL_PATH,
            "--host", "0.0.0.0", "--port", str(PORT),
            "--api-key", os.environ["SGLANG_API_KEY"],
            "--quantization", "fp8",          # D2
            "--context-length", "16384",      # D4 — 16k
        ]
        self.proc = subprocess.Popen(cmd)
        # _wait_until_healthy(); _warmup(); _post("/release_memory_occupation")

    @modal.enter(snap=False)
    def resume(self):
        ...  # _post("/resume_memory_occupation")

    @modal.web_server(port=PORT, requires_proxy_auth=True, startup_timeout=10 * 60)  # D7
    def serve(self):
        ...  # SGLang already serves the OpenAI-compatible API on PORT

This is a structural skeleton, not a tested file. The exact SGLang flags, the warm-up / release / resume calls, and the web_server vs asgi_app wiring get finalized against the real model + SGLang version once D1/D2/D4 are confirmed.

15. Async render flow (maps to Boson video API)#

A pure-CPU control-plane server sits between the gateway and the GPU renderers. Deployed in the gateway's internal network (so it can reach internal Redis and public S3), it is the control plane: it standardizes the Boson video API, tracks job status in Redis, dispatches renders, and uploads the finished MP4 to S3. The GPU renderer — in-house or Modal — is pure compute: it receives a render request and returns the MP4 bytes. It never touches Redis or S3.

API (unchanged): POST /v1/videos → {id, status:"queued", progress, ...}; GET /v1/videos/{id} (status); GET /v1/videos/{id}/content (MP4). Status ∈ queued|in_progress|completed|failed.

Tier	Network	Responsibilities
Gateway	edge	ingress, auth, route to control plane
CPU control-plane server	internal (Redis + public S3)	API standardization, job status in Redis, render dispatch + spillover, MP4 upload to S3
GPU renderer (in-house / Modal)	compute only	render frames → return MP4 bytes; no Redis, no S3, no state

§15 — Async render flow. The CPU control plane owns Redis + S3; the Modal renderer only returns MP4 bytes. The renderer never reaches internal state — no callback, no VPC.

Why this is clean — the renderer touches nothing internal. Redis and S3 are owned entirely by the CPU control plane, which already lives in the internal network. The Modal renderer only needs its inbound model weights (read from the Modal Volume, provisioned once from our private store) and returns MP4 bytes over its proxy-auth'd endpoint. So:

No HTTPS callback, no modal.Proxy egress IP, no VPC — the renderer never reaches Redis or S3.
Fully stateless: no container affinity, no "which container has the MP4", no sticky sessions (§4.4).

Status reporting — decision: the CPU control plane writes Redis directly (it's already inside the internal network). The render call is just request→MP4-bytes. Superseded: the earlier "HTTPS callback from the renderer" idea is no longer needed, because the renderer no longer owns S3 or status.

Progress granularity: coarse status (queued → in_progress → completed) comes free from the dispatch lifecycle. For live progress 0–100, have the renderer stream progress events back to the control plane during the render; otherwise the control plane reports coarse status.

Work items:

CPU control-plane server: Boson API standardization + Redis status writer + S3 uploader
Render dispatch with spillover (in-house default, Modal on saturation)
Modal render endpoint returns MP4 bytes (proxy-auth protected); optional progress stream

Modal Burst-Overflow Scalability Plan

Modal Burst-Overflow Scalability Plan

1. Goal#

2. What Modal actually is (architecture)#

3. Our workload facts#

4. Target architecture#

4.1 The endpoint: one URL, load-balanced across replicas (verified)#

4.2 Where load-balancing behavior is defined#

4.3 Scale up/down (measured 2026-06-29)#

4.4 Governing principle — control plane vs render plane#

5. Cold-start strategy (the core problem)#

Anatomy of our cold start (what we're actually fighting)#

Measured Volume load speed (benchmarked 2026-06-16)#

6. Security & networking ("proxy")#

Inbound — protect the endpoint (REQUIRED)#

Outbound — static egress IP (ONLY IF NEEDED)#

7. Autoscaling knobs (Modal primitives, not K8s)#

8. Cost model#

9. Go-live checklist#

10. Decisions (locked)#

11. References#

12. Prototype artifacts (this workspace)#

13. Implementation runbook (step by step)#

14. Reference skeleton — sglang_server.py (to be filled)#

15. Async render flow (maps to Boson video API)#

14. Reference skeleton — `sglang_server.py` (to be filled)#