Infrastructure Design · Boson AI

Modal Burst-Overflow Scalability Plan

Full technical specification — architecture, cold-start strategy, benchmarks, decisions & runbook.

Owner: erik@boson.aiStatus: Implementation-readyScope: inference / rendering

Modal Burst-Overflow Scalability Plan

Owner: erik@boson.ai Last updated: 2026-06-16 Status: Implementation-ready — all §10 decisions locked


Provenance of figures. Numbers are tagged by source so none are taken on faith: [M] measured by us on Modal this session · [D] Modal published docs · [T] team-provided · [E] estimate (illustrative, not measured — flagged inline). • [M]: 0.92 GB/s volume read, 0→12→0 autoscale, 15-req→9-container fan-out, region pin us-eastus-east-2. • [T]: 75 GB image, ~40 GB VRAM, FP8, 16k context, $100k/mo budget. • [D]: "<1s" start, 3–10× snapshot speedup, 1.5×/1.75× region premium, 1–2 GB/s volume, ≤5 proxy IPs. • [E]: the §2 K8s-vs-Modal cold-start ranges and the SGLang init time — directional only.

1. Goal#

Use Modal as a burst-overflow inference tier, not the primary serving layer.

Non-goals: running 24/7 max-throughput on Modal (in-house is cheaper for constant load); training (this doc is inference-serving only).


2. What Modal actually is (architecture)#

Modal is not Kubernetes. It is a custom serverless GPU platform:

Layer Implementation
Orchestration Custom multi-cloud scheduler in Rust (deliberately not Docker/K8s)
Container runtime Custom Rust runtime, sub-second container start
Isolation gVisor (multi-tenant sandbox; first to run gVisor w/ GPUs)
Filesystem ImageFS — lazy-loading, content-addressed, FUSE. Images stream on demand
Hardware Bare-metal GPU instances pooled across clouds (e.g. Oracle OCI bare metal)

Implications for us:

Why container start is "<1s" (and what it excludes). The <1s is the runtime bringing up the container process on an already-warm bare-metal machine. Modal hits it by deleting the three things that make a "normal" cold start slow:

Normal slow step Typical time [E] Modal's trick
Provision a VM / add a K8s node 30 s – 5 min Warm bare-metal pool, GPUs pre-attached → ~0
docker pull (extract all layers first) 30 s – min (multi-GB GPU images) ImageFS lazy mount — start now, stream files on demand
Container runtime setup 0.1 – 1 s Rust runtime + gVisor, optimized hot path

The time ranges above and the 2–10 min K8s figure below are [E] typical industry estimates, illustrative only — not measured by us, not published by Modal. They convey the direction (Modal removes the node-provision and image-pull waits), not benchmarked values.

The "<1s" Modal start [D] excludes loading our model weights. A GPU pod scaling from zero on K8s is typically 2–10 min [E] end-to-end; Modal collapses that to seconds — but our service cold start is still gated by weight load + SGLang init (see §5).


3. Our workload facts#

Fact Value
Image size (weights + SGLang + CUDA) ~75 GB
Model Closed-source, FP8 (no HF id; weights from our private store)
Runtime VRAM ~40 GB (FP8 model + 16k KV cache) → fits on one GPU
Serving engine SGLang
GPU Any single FP8-capable GPU with enough VRAM (H100 / H200 / L40S …) — take whatever Modal has (D3)
GPU count 1 (keeps us on the supported single-GPU snapshot path)

4. Target architecture#

Clientsrequests Gatewayedge · auth CPU control-plane server internal net · Redis + S3 API · status · S3 upload render dispatch + spillover Redis · internal S3 · public has capacity saturated → spill In-house rendererbaseline · always-on Modal rendererGPU · statelessburst · scales to 0
§4 — Target architecture. A pure-CPU control-plane server (co-located with the gateway, in the internal network) owns the API, Redis status, S3 upload, and render dispatch. The GPU renderer is interchangeable, stateless compute behind it.

A pure-CPU control-plane server (co-located with the gateway, inside the internal network with access to internal Redis and public S3) owns API standardization, job status, the MP4 upload, and the spillover decision. The GPU renderer — in-house or Modal — is interchangeable, stateless compute behind it.

4.1 The endpoint: one URL, load-balanced across replicas (verified)#

A Modal web endpoint is one stable URL, not a per-machine address:

Gatewayone URL Modal proxy*.modal.runstateless LB · 0..N container 1AWS · us-east-1renders requests container 2Azure · eastus2 container 3GCP · europe-west1 … container N
§4.1 — One stable URL, load-balanced across replicas. Verified: 15 requests → 9 containers across AWS + Azure + GCP. Individual containers are not addressable.

4.2 Where load-balancing behavior is defined#

Modal owns the routing; you own the knobs — there is no custom routing/LB config to write.

Concern Who controls it How
Request → container routing Modal proxy (automatic) Routes to any container with spare capacity. No rules to define.
How many requests per container before fanning out You @modal.concurrent(target_inputs=…, max_inputs=…)target_inputs = autoscaler goal, max_inputs = hard ceiling (containers burst to this if scale-up lags)
How many containers exist You min_containers, buffer_containers, max_containers, scaledown_window
Session affinity / sticky routing Not available in Modal Routing is stateless. If we ever need affinity or custom LB, it goes in our gateway in front of Modal, or via separate named apps.
Region of placement You (optional) region= (adds 1.5–1.75× cost; default = unpinned multi-cloud pool)

For our stateless OpenAI-style inference, the default stateless LB is exactly right — we only tune @modal.concurrent (throughput/$ per replica) and the autoscaler bounds.

4.3 Scale up/down (measured 2026-06-29)#

Tested with min_containers=0, scaledown_window=20, @modal.concurrent(max_inputs=1), driving 12 concurrent requests at the single URL:

Phase Containers
At rest 0 (scaled to zero, $0)
Under load (12 concurrent) 12 (one per request, max_inputs=1)
Idle, within scaledown_window 12 (held ~20–24s after last request)
Idle +32s → +40s 12 → 1 → 0

Confirms the overflow economics: replicas exist only during the burst + a short cooldown, then drain to zero. scaledown_window tunes how long warm replicas linger (latency vs cost).

4.4 Governing principle — control plane vs render plane#

Modal is a stateless render plane (compute only). The upper layer — a pure-CPU control-plane server co-located with the gateway — owns all state. Modal scales the renderer; status, lifecycle, Redis, and S3 stay above it.

Concern Owner
API standardization (POST /v1/videos, GET /{id}, GET /{id}/content) CPU control-plane server
Job lifecycle, video_id, status/progress CPU control-plane server → Redis (internal)
Render dispatch + spillover (in-house vs Modal) CPU control-plane server
MP4 upload to S3 (+ presigned reads) CPU control-plane server (public S3 access)
GPU rendering (frames → MP4 bytes) Renderer (in-house / Modal) — pure compute, stateless

Because the renderer holds no state, the hard distributed-systems questions disappear: "which container has the mp4?" (none — the CPU control plane uploads it to S3 by video_id), sticky sessions (N/A — fire-and-forget), keeping replicas warm / region pinning (nothing to keep). The renderer touches nothing internal — it just returns the MP4 bytes to the CPU control plane, which (already inside the internal network) owns Redis and S3. So there is no HTTPS callback, no egress proxy, and no VPC on the Modal side.


5. Cold-start strategy (the core problem)#

Anatomy of our cold start (what we're actually fighting)#

container start (<1s [D])  +  weight load (~43s [M-derived])  +  SGLang init/compile (10s–min [E])

Scale-to-zero means the cold start lands exactly when a burst hits. Three layers fix it:

  1. Split the 75 GB image

    • Image = SGLang + CUDA only (few GB).
    • Weights → Modal Volume (hf-cache), pinned to an exact revision.
    • Pre-compiled kernels (DeepGEMM) → a second Volume, built at image-build time.
  2. Memory snapshot of the warm SGLang server (officially supported; 3–10× faster [D])

    • Single GPU → enable_memory_snapshot=True + experimental_options={"enable_gpu_snapshot": True}.
    • @modal.enter(snap=True): start SGLang, warm it, /release_memory_occupation, then snapshot.
    • @modal.enter(snap=False): /resume_memory_occupation on restore.
    • Env: TORCHINDUCTOR_COMPILE_THREADS=1 (else torch.compile breaks snapshots).
    • Note: snapshot "warms up" over the first ~5 cold starts, not instantly on day one.
  3. Burst headroom

    • min_containers=0 (pay nothing idle) + buffer_containers=1 (first spike request doesn't eat a full cold start).
    • Optional predictive pre-warm: bump min_containers on a schedule before known peaks.

Measured Volume load speed (benchmarked 2026-06-16)#


6. Security & networking ("proxy")#

Two distinct concerns:

Inbound — protect the endpoint (REQUIRED)#

A public GPU endpoint that cold-starts on request is a cost/DoS risk.

Outbound — static egress IP (ONLY IF NEEDED)#


7. Autoscaling knobs (Modal primitives, not K8s)#

Knob Setting Purpose
min_containers 0 (or small for hot peaks) Pay nothing idle
buffer_containers 1–2 Absorb first spike without cold start
max_containers set a ceiling Cost cap / blast-radius limit
@modal.concurrent(max_inputs=N) tuned Requests packed per replica (SGLang batches → $/token lever)
scaledown_window e.g. 300 s How long idle replicas linger before →0

8. Cost model#


9. Go-live checklist#

Must-have

Nice / depends


10. Decisions (locked)#

All decisions resolved with the team — values below are final.

# Decision Value Notes
D1 Model source Closed-source — no HF id Private weights provisioned once into a Modal Volume from our own store (pinned). Store creds are a Secret used only by the provisioning job; the renderer reads weights from the Volume.
D2 Quantization FP8 (done) Model already FP8. Constrains GPU choice to FP8-capable hardware (D3).
D3 GPU Any single FP8-capable GPU with enough VRAM — no fixed type Take whatever Modal has via a preference list (e.g. gpu=["H100","H200","L40S"]), ordered by availability/cost; exclude non-FP8 parts (e.g. A100). Still 1 GPU (keeps single-GPU snapshot path). VRAM floor = FP8 model + 16k KV cache (≈ the ~40 GB runtime + headroom). Caveat: GPU snapshots are hardware-specific → expect one snapshot baked per GPU type in the pool (validate; narrow the list if it hurts).
D4 Context length 16k (--context-length 16384) Size KV cache for 16k; included in the D3 VRAM floor.
D5 Snapshot / failure GPU snapshot ON, no CPU fallback enable_gpu_snapshot=True. On snapshot/GPU failure: terminate and respawn a fresh container — no degraded CPU path.
D6 Warm capacity min_containers=0, buffer_containers=1 Pure overflow = $0 idle; buffer absorbs the first spike.
D7 Inbound auth requires_proxy_auth=True; tokens held by the control-plane server The Modal URL is public, but only the CPU control server calls it, so it carries the Modal-Key/Modal-Secret. Unauthorized traffic is rejected at Modal's edge — no GPU spin-up.
D8 Outbound egress proxy Not needed Renderer needs no internal access; the control server reaches the renderer.
D9 Load distribution Use Modal's built-in autoscaling + LB Individual pods aren't externally addressable, so we send overflow to the single Modal endpoint and let Modal scale/route internally. The control server decides when to spill (in-house saturated) and forwards to the Modal URL.
D10 Cost ceiling $100k / month Enforce via Modal workspace spend limit + billing alerts; size max_containers so worst-case sustained burst stays within budget.
D11 Region Default (no pin) Avoids the 1.5–1.75× premium; maximum availability.
D12 Environments dev + prod, deploy via CI Pinned image + weights per env.

✅ All decisions locked — ready to implement.


11. References#


12. Prototype artifacts (this workspace)#

In ~/modal_examples/ (workspace bosonai):


13. Implementation runbook (step by step)#

Prereqs: modal CLI authed (workspace bosonai). Fill the §10 <TODO>s first.

Step 0 — Secrets

modal secret create hf-token            HF_TOKEN=<token>           # if gated/private model
modal secret create sglang-api-key      SGLANG_API_KEY=<random>    # app-level key (defense in depth)
# Proxy-auth tokens are created in the Modal dashboard (Settings → Proxy Auth Tokens),
# then handed to the gateway as Modal-Key / Modal-Secret.

Step 1 — Provision weights into the Volume (one-time job)

Step 2 — Pre-compile kernels into a second Volume

Step 3 — Build the serving image (SGLang + CUDA only, NO weights — see §14 skeleton).

Step 4 — First deploy + warm the snapshot

modal deploy sglang_server.py

Step 5 — Validate cold start

Step 6 — Wire the gateway router

Step 7 — Guardrails & observability

Step 8 — Promote dev → prod (separate Modal environments, deploy via CI).


14. Reference skeleton — sglang_server.py (to be filled)#

import os, subprocess
import modal

MODEL_PATH = "/weights/boson-avatar"     # D1 — closed-source, provisioned into Volume (no HF id)
PORT = 30000                              # SGLang default

image = (
    modal.Image.from_registry("nvidia/cuda:12.x-devel-ubuntu22.04", add_python="3.12")
    .pip_install("sglang[all]==<TODO>", "huggingface_hub[hf_transfer]")
    .env({
        "HF_HUB_ENABLE_HF_TRANSFER": "1",
        "TORCHINDUCTOR_COMPILE_THREADS": "1",   # required for snapshot + torch.compile
        "SGLANG_ENABLE_JIT_DEEPGEMM": "1",
    })
    # .run_function(precompile_deepgemm, volumes={...})   # Step 2
)

weights  = modal.Volume.from_name("boson-weights", create_if_missing=True)  # D1 private weights
kernels  = modal.Volume.from_name("sglang-kernels", create_if_missing=True)
app = modal.App("boson-sglang-overflow")

@app.cls(
    image=image,
    gpu=["H100", "H200", "L40S"],                     # D3 — any single FP8-capable GPU w/ enough VRAM
    volumes={"/weights": weights, "/kernels": kernels},
    secrets=[modal.Secret.from_name("sglang-api-key")],  # D1 weights live in the Volume, no model secret
    enable_memory_snapshot=True,                       # D5 (GPU snapshot; no CPU fallback — respawn on fail)
    experimental_options={"enable_gpu_snapshot": True},
    min_containers=0, buffer_containers=1,             # D6
    max_containers=int(os.getenv("MAX_CONTAINERS", "...")),  # D10 — sized to the $100k/mo budget
    scaledown_window=300,
    timeout=20 * 60,
)
@modal.concurrent(max_inputs=32)                       # tune to KV headroom
class SGLangServer:
    @modal.enter(snap=True)
    def start_and_warm(self):
        # launch sglang server -> wait ready -> warm requests -> /release_memory_occupation
        cmd = [
            "python", "-m", "sglang.launch_server",
            "--model-path", MODEL_PATH,
            "--host", "0.0.0.0", "--port", str(PORT),
            "--api-key", os.environ["SGLANG_API_KEY"],
            "--quantization", "fp8",          # D2
            "--context-length", "16384",      # D4 — 16k
        ]
        self.proc = subprocess.Popen(cmd)
        # _wait_until_healthy(); _warmup(); _post("/release_memory_occupation")

    @modal.enter(snap=False)
    def resume(self):
        ...  # _post("/resume_memory_occupation")

    @modal.web_server(port=PORT, requires_proxy_auth=True, startup_timeout=10 * 60)  # D7
    def serve(self):
        ...  # SGLang already serves the OpenAI-compatible API on PORT

This is a structural skeleton, not a tested file. The exact SGLang flags, the warm-up / release / resume calls, and the web_server vs asgi_app wiring get finalized against the real model + SGLang version once D1/D2/D4 are confirmed.


15. Async render flow (maps to Boson video API)#

A pure-CPU control-plane server sits between the gateway and the GPU renderers. Deployed in the gateway's internal network (so it can reach internal Redis and public S3), it is the control plane: it standardizes the Boson video API, tracks job status in Redis, dispatches renders, and uploads the finished MP4 to S3. The GPU renderer — in-house or Modal — is pure compute: it receives a render request and returns the MP4 bytes. It never touches Redis or S3.

API (unchanged): POST /v1/videos{id, status:"queued", progress, ...}; GET /v1/videos/{id} (status); GET /v1/videos/{id}/content (MP4). Status ∈ queued|in_progress|completed|failed.

Tier Network Responsibilities
Gateway edge ingress, auth, route to control plane
CPU control-plane server internal (Redis + public S3) API standardization, job status in Redis, render dispatch + spillover, MP4 upload to S3
GPU renderer (in-house / Modal) compute only render frames → return MP4 bytes; no Redis, no S3, no state
CLIENTCPU CONTROL PLANE · internal (Redis + S3)MODAL RENDERER POST /v1/videos create video_idRedis = queueddispatch render 200 {id, status:queued} render (spill on burst) render frames→ return MP4 bytesstateless · no Redis/S3 MP4 bytes upload → S3 / video_idRedis = completed GET /{id} → status (Redis) GET /{id}/content → presigned S3 URL
§15 — Async render flow. The CPU control plane owns Redis + S3; the Modal renderer only returns MP4 bytes. The renderer never reaches internal state — no callback, no VPC.

Why this is clean — the renderer touches nothing internal. Redis and S3 are owned entirely by the CPU control plane, which already lives in the internal network. The Modal renderer only needs its inbound model weights (read from the Modal Volume, provisioned once from our private store) and returns MP4 bytes over its proxy-auth'd endpoint. So:

Status reporting — decision: the CPU control plane writes Redis directly (it's already inside the internal network). The render call is just request→MP4-bytes. Superseded: the earlier "HTTPS callback from the renderer" idea is no longer needed, because the renderer no longer owns S3 or status.

Progress granularity: coarse status (queued → in_progress → completed) comes free from the dispatch lifecycle. For live progress 0–100, have the renderer stream progress events back to the control plane during the render; otherwise the control plane reports coarse status.

Work items: