Infrastructure Design · Boson AI

Modal Burst-Overflow Scalability Plan

Full technical specification — architecture, cold-start strategy, benchmarks, decisions & runbook.

Owner: erik@boson.aiStatus: Implementation-readyScope: inference / rendering

Modal Burst-Overflow Scalability Plan

Owner: erik@boson.ai Last updated: 2026-06-29 Status: Design resolved — renderer = async video job (returns MP4), invoked via spawn(); C1/C2 closed. Remaining gate: real-image validation (no real-workload numbers yet). Ready to prototype.


Provenance of figures. Every number is tagged so none is taken on faith:

  • [M-platform] — Modal platform behaviour we observed this session using a synthetic stand-in container (a cheap debian_slim CPU container with a time.sleep() as fake work). It validates Modal's scheduler/autoscaler, not our real image, model, GPU, or render.
  • [D] — Modal published docs. · [T] — team-provided. · [E] — estimate (illustrative).

⚠️ No real-workload numbers exist yet. Cold-start time, per-GPU render concurrency, throughput, GPU-OOM behaviour, snapshot speedup, and $/render all require deploying the real 75 GB image on a GPU — not yet done. Every [M-platform] figure reflects Modal mechanics with a stand-in, never the renderer.

Tag inventory:

  • [M-platform]: scale 0→12→0, 15-req→9-container fan-out, region pin us-eastus-east-2, 0.92 GB/s Volume read (3 GB file, CPU box).
  • [T]: 75 GB image, ~40 GB VRAM, FP8, 16k context, $100k/mo budget.
  • [D]: "<1s" container start, 3–10× snapshot speedup, 1.5×/1.75× region premium, 1–2 GB/s Volume, ≤5 proxy IPs, 150s web timeout.
  • [E]: §2 K8s-vs-Modal cold-start ranges, SGLang init time, the ~43s weight-load extrapolation.

Design-review resolutions#

  • C1 — workload identity → RESOLVED. The Modal renderer is an async video job: it takes a render request and returns the finished MP4. SGLang is the internal token backbone; the renderer's external interface is the video job, not an OpenAI/streaming endpoint.
  • C2 — render transport → RESOLVED. Use function invocation via spawn() (function timeout, not the 150s web limit [D]). Auth = Modal API token on the control plane (D7). §14/§15 reflect this.
  • H1/H2 — GPU/snapshot: narrowed to one validated GPU type (snapshots are alpha + hardware-specific; L40S marginal). See D3/D5.
  • H4 — spillover/back-pressure: designed in §16 (was a checkbox).

Remaining gate: deploy the real image to replace the [E]/[M-platform] figures with real measurements (cold start, per-GPU concurrency, GPU-OOM headroom, $/render).

1. Goal#

Use Modal as a burst-overflow rendering tier, not the primary serving layer.

Non-goals: running 24/7 max-throughput on Modal (in-house is cheaper for constant load); training.


2. What Modal actually is (architecture)#

Modal is not Kubernetes. It is a custom serverless GPU platform:

Layer Implementation
Orchestration Custom multi-cloud scheduler in Rust (deliberately not Docker/K8s)
Container runtime Custom Rust runtime, sub-second container start [D]
Isolation gVisor (multi-tenant sandbox; runs gVisor with GPUs, per Modal)
Filesystem ImageFS — lazy-loading, content-addressed, FUSE. Images stream on demand
Hardware Bare-metal GPU instances pooled across clouds (e.g. Oracle OCI bare metal)

Implications for us:

Why container start is "<1s" (and what it excludes). Modal removes the three things that make a "normal" cold start slow:

Normal slow step Typical time [E] Modal's trick
Provision a VM / add a K8s node 30 s – 5 min Warm bare-metal pool, GPUs pre-attached → ~0
docker pull (extract all layers first) 30 s – min (multi-GB GPU images) ImageFS lazy mount — start now, stream files on demand
Container runtime setup 0.1 – 1 s Rust runtime + gVisor, optimized hot path

The time ranges above and the 2–10 min K8s figure below are [E] typical industry estimates, illustrative only — not measured by us, not published by Modal. They convey the direction (Modal removes node-provision + image-pull waits), not benchmarked values.

The "<1s" Modal start [D] excludes loading our model weights. A GPU pod scaling from zero on K8s is typically 2–10 min [E] end-to-end; Modal collapses that to seconds — but our service cold start is still gated by weight load + engine init (see §5).


3. Our workload facts#

Fact Value Source
Image size (weights + SGLang + CUDA) ~75 GB [T]
Model Closed-source, FP8 (no HF id; weights from our private store) [T]
Runtime VRAM ~40 GB (incl. 16k KV cache) [T]
Engine SGLang (internal token backbone) [T]
External interface Async video job → returns MP4 (invoked via spawn()) resolved (C1/C2)
GPU One GPU with validated VRAM headroom. For the snapshot path, pin a single GPU type (snapshots are hardware-specific — H1). Likely H100/H200; L40S (48 GB) is marginal for 40 GB + overhead — validate before listing (H2). FP8 requires Hopper/Ada/Blackwell (excludes A100). revised post-review

Not yet measured: real VRAM-at-runtime per candidate GPU, and whether 40 GB + KV + snapshot overhead actually fits 48 GB. Validate before committing to a GPU.


4. Target architecture#

Clientsrequests Gatewayedge · auth CPU control-plane server internal net · Redis + S3 API · status · S3 upload render dispatch + spillover Redis · internal S3 · public has capacity saturated → spill In-house rendererbaseline · always-on Modal rendererGPU · statelessburst · scales to 0
§4 — Target architecture. A pure-CPU control-plane server (co-located with the gateway, in the internal network) owns the API, Redis status, S3 upload, and render dispatch. The GPU renderer is interchangeable, stateless compute behind it.

A pure-CPU control-plane server (co-located with the gateway, inside the internal network with access to internal Redis and public S3) owns API standardization, job status, the MP4 upload, and the spillover decision (§16). The GPU renderer — in-house or Modal — is stateless compute behind it.

4.1 Modal load-balancing mechanics (validated with a stand-in container)#

What was tested: a synthetic debian_slim CPU container (FastAPI + time.sleep), not our renderer. These results characterise Modal's scheduler/LB, and say nothing about our image, GPU, cold start, or real concurrency.

A Modal web endpoint is one stable URL, not a per-machine address:

Gatewayone URL Modal proxy*.modal.runstateless LB · 0..N container 1AWS · us-east-1renders requests container 2Azure · eastus2 container 3GCP · europe-west1 … container N
§4.1 — Modal LB mechanics, [M-platform] (synthetic CPU container, NOT our renderer): 15 requests → 9 containers across AWS + Azure + GCP. Demonstrates Modal's scheduler, not our per-GPU concurrency.

4.2 Where load-balancing behavior is defined#

Modal owns the routing; you own the knobs — there is no custom routing/LB config to write.

Concern Who controls it How
Request → container routing Modal (automatic) Routes to any container with spare capacity.
Requests packed per container before fan-out You @modal.concurrent(target_inputs=…, max_inputs=…)target_inputs = autoscaler goal, max_inputs = hard ceiling
How many containers exist You min_containers, buffer_containers, max_containers, scaledown_window
Session affinity / sticky routing Not available Routing is stateless. Affinity, if ever needed, lives in our control plane.
Region of placement You (optional) region= (adds 1.5–1.75× cost [D]; default = unpinned)

4.3 Scale up/down — [M-platform] synthetic autoscaler test#

Synthetic test, not our renderer. Container = debian_slim + time.sleep(3); min_containers=0, scaledown_window=20, @modal.concurrent(max_inputs=1); driven by 12 concurrent test requests.

Phase Containers
At rest 0 (scaled to zero, $0)
Under load (12 test requests) 12
Idle, within scaledown_window 12 (held ~20–24 s)
Idle +32 s → +40 s 12 → 1 → 0

What this proves: Modal's scale-to-zero, fan-out, and drain mechanics work. What the "12" is NOT: the 12 = 12 test requests × max_inputs=1 (one request pins one container). It is the test config, not our renderer's concurrency. Production uses max_inputs=32, where 12 requests ≈ 1 container. This measures nothing about our image, cold start, or real per-GPU concurrency — those need the real image deployed.

4.4 Governing principle — control plane vs render plane#

Modal is a stateless render plane (compute only). The pure-CPU control-plane server owns all state — status, lifecycle, Redis, S3.

Concern Owner
API standardization (POST /v1/videos, GET /{id}, GET /{id}/content) CPU control-plane server
Job lifecycle, video_id, status/progress CPU control-plane server → Redis (internal)
Render dispatch + spillover (in-house vs Modal) CPU control-plane server (§16)
MP4 upload to our S3 (+ presigned reads) CPU control-plane server
GPU rendering (frames → MP4) Renderer (in-house / Modal) — stateless compute

Because the renderer holds no state, the hard distributed-systems questions disappear: "which container has the mp4?" (none — the control plane uploads it to our S3 by video_id), sticky sessions (N/A), keeping replicas warm / region pinning (nothing to keep).

Caveat (C2/[D]): the renderer returns the MP4 as the Modal function result. Modal stages large return values through its own object store transparently — so the bytes do transit Modal storage, even though the renderer never touches our Redis/S3. The "renderer touches nothing internal" claim is true of our network only.


5. Cold-start strategy (the core problem)#

Anatomy of our cold start#

container start< 1 s[D] Modal docs + weight load~43 s[E] extrapolated, generic probe + engine init / compile10 s – min[E] estimate — to confirm ↑ memory snapshot removes both on every burst → cold start back to seconds
§5 — Anatomy of a cold start. [D] Modal docs · [E] estimate (not measured on the real model). The snapshot is meant to skip weight-load + init on subsequent bursts — realised cold start is unmeasured until the real image runs.

Scale-to-zero means the cold start lands exactly when a burst hits. Three layers attack it:

  1. Split the 75 GB image — image = SGLang + CUDA only; weights → Modal Volume (pinned); pre-compiled DeepGEMM kernels → a second Volume.

  2. Memory snapshot of the warm server — documented to give 3–10× [D] faster cold starts.

    • GPU snapshot is an alpha feature [D] and hardware-specific — see H1/D5.
    • @modal.enter(snap=True): start engine, warm it, /release_memory_occupation, snapshot.
    • @modal.enter(snap=False): /resume_memory_occupation on restore.
    • TORCHINDUCTOR_COMPILE_THREADS=1 (else torch.compile breaks snapshots [D]).
    • Warms up over the first ~5 [D] cold starts — per GPU type (H1).
  3. Burst headroommin_containers=0 + buffer_containers=1 (covers only one concurrent first request, not a burst — see §16); optional predictive pre-warm before known peaks.

Volume read probe — [M-platform], generic, NOT the real model#


6. Security & networking ("proxy")#

Inbound auth — function invocation (resolved)#

Outbound — static egress IP (not needed)#


7. Autoscaling knobs (Modal primitives, not K8s)#

Knob Setting Purpose
min_containers 0 (or small for hot peaks) Pay nothing idle
buffer_containers 1–2 Absorb the first spike request(s) — limited (§16)
max_containers set a ceiling Cost cap / blast-radius limit (D10)
@modal.concurrent(target_inputs, max_inputs) tuned to real per-GPU concurrency (measure) Requests packed per replica ($/render lever)
scaledown_window e.g. 300 s How long idle replicas linger before →0

8. Cost model#


9. Go-live checklist#

Remaining before sign-off (C1/C2 resolved)

Must-have

Nice / depends


10. Decisions#

Set with the team; D3/D5/D7 revised after the design review; C1/C2 resolved.

# Decision Value Notes
D1 Model source Closed-source — no HF id Private weights provisioned once into a Modal Volume from our store (pinned). Store creds = a Secret used only by the provisioning job.
D2 Quantization FP8 (done) Constrains GPU to FP8-capable hardware (Hopper/Ada/Blackwell; excludes A100).
D3 GPU One GPU, single validated type (revised) GPU snapshots are hardware-specific (H1), so a multi-type list fragments them (one snapshot per type, rarely-used types never warm). Pin one type with validated VRAM headroom (likely H100/H200; L40S marginal — validate, H2). Not the earlier open ["H100","H200","L40S"] list.
D4 Context length 16k The token backbone's sequence length; sized into the VRAM floor.
D5 Snapshot / failure GPU snapshot ON (alpha), no CPU fallback (revised) enable_gpu_snapshot=True is alpha [D]. On snapshot/GPU failure → respawn a fresh container. Accept that the tail (cold, un-snapshotted) pays full cold start.
D6 Warm capacity min_containers=0, buffer_containers=1 $0 idle; buffer covers only the first request(s) — burst handling in §16.
D7 Auth Modal API token on control plane Render via spawn()no public endpoint; the control plane authenticates to Modal with an API token (Secret). No public attack surface.
D8 Outbound egress proxy Not needed Renderer needs no internal access.
D9 Load distribution Use Modal's built-in autoscaling + LB Pods aren't externally addressable; the control plane decides when to spill (§16) and Modal distributes internally.
D10 Cost ceiling $100k / month Modal spend limit + alerts; size max_containers (arithmetic TBD once $/render known).
D11 Region Default (no pin) Avoids the 1.5–1.75× premium.
D12 Environments dev + prod, deploy via CI Pinned image + weights per env.

Remaining gate: real-image validation (cold start, per-GPU concurrency, OOM, $/render) before final sign-off. C1/C2 resolved.


11. References#


12. Prototype artifacts (this workspace)#

In ~/modal_examples/ (workspace bosonai) — all stand-in tests, not the real renderer:


13. Implementation runbook (step by step)#

Prereqs: modal CLI authed (workspace bosonai); C1 + C2 resolved.

Step 0 — Secrets: private weight-store creds; Modal API token for the control plane (function path) or proxy-auth tokens (web path); app api-key.

Step 1 — Provision weightsboson-weights Volume from our private store (creds via Secret), pinned to the exact version; volume.commit().

Step 2 — Pre-compile kernels (sglang.compile_deep_gemm) → second Volume (avoids JIT during snapshot).

Step 3 — Build the serving image (engine + CUDA only, NO weights).

Step 4 — First deploy + bake the snapshot — invoke ~5 times per GPU type to bake snapshots.

Step 5 — Measure cold start (real image) — scale to zero, invoke, measure restored cold start. Target a few seconds — unverified until measured.

Step 6 — Wire the control plane — render dispatch via spawn(); spillover + back-pressure (§16).

Step 7 — Guardrails & observabilitymax_containers; GPU-OOM alert; SLIs (§17).

Step 8 — Promote dev → prod.


14. Reference skeleton — renderer.py (illustrative, NOT tested)#

⚠️ Reflects the resolved design: renderer = async video job returning MP4, invoked via spawn() (function timeout, not the 150s web limit), single GPU type, GPU snapshot (alpha). SGLang is the internal token backbone; exact flags/values finalize against the real model.

import os, subprocess
import modal

MODEL_PATH = "/weights/boson-avatar"     # D1 — closed-source, provisioned into Volume
PORT = 30000

image = (
    modal.Image.from_registry("nvidia/cuda:12.x-devel-ubuntu22.04", add_python="3.12")
    .pip_install("sglang[all]==<pin>")
    .env({"TORCHINDUCTOR_COMPILE_THREADS": "1", "SGLANG_ENABLE_JIT_DEEPGEMM": "1"})
    # .run_function(precompile_deepgemm, volumes={...})   # Step 2
)
weights = modal.Volume.from_name("boson-weights", create_if_missing=True)
kernels = modal.Volume.from_name("sglang-kernels", create_if_missing=True)
app = modal.App("boson-renderer-overflow")

@app.cls(
    image=image,
    gpu="H100",                                       # D3 — ONE validated GPU type (not a list)
    volumes={"/weights": weights, "/kernels": kernels},
    enable_memory_snapshot=True,                       # D5 — alpha; no CPU fallback, respawn on fail
    experimental_options={"enable_gpu_snapshot": True},
    min_containers=0, buffer_containers=1,             # D6
    max_containers=int(os.getenv("MAX_CONTAINERS", "...")),  # D10 — size to $100k/mo
    scaledown_window=300,
    timeout=20 * 60,                                   # function timeout (NOT the 150s web limit)
)
@modal.concurrent(max_inputs=8)                        # MEASURE real per-GPU concurrency first
class Renderer:
    @modal.enter(snap=True)
    def start_and_warm(self):
        cmd = ["python", "-m", "sglang.launch_server", "--model-path", MODEL_PATH,
               "--host", "0.0.0.0", "--port", str(PORT),
               "--quantization", "fp8", "--context-length", "16384"]   # D2/D4
        self.proc = subprocess.Popen(cmd)
        # _wait_until_healthy(); _warmup(); _post("/release_memory_occupation")

    @modal.enter(snap=False)
    def resume(self):
        ...  # _post("/resume_memory_occupation")

    @modal.method()                                    # C2 — called via .spawn(); NOT a web endpoint
    def render(self, req) -> bytes:
        # drive the local engine on PORT, return the MP4 bytes (Modal stages large results in
        # its own object store transparently; we still never touch our Redis/S3 here)
        ...

Control plane (their CPU server, using the Modal SDK):

call = Renderer().render.spawn(req)        # non-blocking; returns a FunctionCall
# ... poll call (FunctionCall.from_id) or await result; then upload bytes to OUR S3 + update Redis

15. Async render flow (maps to Boson video API)#

A pure-CPU control-plane server sits between the gateway and the GPU renderers. In the internal network (reaches internal Redis + public S3), it standardizes the Boson video API, tracks status in Redis, dispatches renders, and uploads the MP4 to our S3.

API (unchanged): POST /v1/videos{id, status:"queued", progress, ...}; GET /v1/videos/{id}; GET /v1/videos/{id}/content. Status ∈ queued|in_progress|completed|failed.

CLIENTCPU CONTROL PLANE · internal (Redis + S3)MODAL RENDERER POST /v1/videos create video_idRedis = queueddispatch render 200 {id, status:queued} render (spill on burst) render frames→ return MP4 bytesstateless · no Redis/S3 MP4 bytes upload → S3 / video_idRedis = completed GET /{id} → status (Redis) GET /{id}/content → presigned S3 URL
§15 — Async render flow. The CPU control plane owns Redis + S3; the Modal renderer only returns MP4 bytes. The renderer never reaches internal state — no callback, no VPC.

Why function-spawn() and not a web request (C2): a Modal web endpoint caps a request at 150 s [D] — a multi-minute render can't return bytes synchronously over HTTP. spawn() uses the function timeout (up to hours), and the control plane retrieves the result via FunctionCall. The renderer never touches our Redis/S3; large MP4 returns transit Modal's object store transparently.

Progress granularity (honest): coarse status (queued→in_progress→completed) is free from the spawn lifecycle. Live progress 0–100 would require the renderer to emit progress events to the control plane — an outbound call that reintroduces a renderer→control-plane dependency. So: default to coarse status; only add live progress if the product requires it, and accept the extra channel.


16. Spillover & back-pressure (design)#

The core value prop — and the part most easily hand-waved. Concretely:


17. Monitoring & machine health#

Do we manage Modal machine health? Almost entirely no — that's the point of serverless. Modal runs continuous GPU health checks across its fleet, drains/replaces bad GPUs, and reschedules containers on failure (gpu-health). No DCGM / node-problem- detector / nvidia-smi babysitting; no nodes to patch. Our D5 "respawn on failure" is largely what Modal already does.

What we DO monitor — at the function/app level:

Layer Watch Where
Modal-side GPU memory (OOM!), GPU util/power, container counts, cold-start time, failure/retry rate Modal dashboard + GPU metrics; abnormal GPU events go to container logs
Control plane render success/fail rate, render latency, queue depth / spillover rate, cold-start p99 our metrics + Redis (already tracks jobs)
Cost $/day vs the $100k/mo budget Modal spend limit + billing alerts

Bottom line: don't build machine-health monitoring. Put observability in the control plane (render success/latency/spillover/cost) plus a GPU-OOM + failure-rate alert on Modal.