Milestone · Boson AI · Modal overflow

Build & validation report

← Back to plan & tracker

Milestone 7 — Cold-start / per-GPU concurrency measured ✅ (stand-in numbers)

Goal: a re-runnable harness that measures the renderer's cold start, warm latency, and burst throughput. Code: ~/modal_examples/milestones/m7_measure.py.

What was built

A harness that, against the deployed M6 renderer:

  1. forces scale-to-zero (sleeps past scaledown_window), then times a cold call,
  2. times a warm call,
  3. fires an 8-spawn burst (Cls cap max_inputs=4) for a throughput smoke test.

The same script runs verbatim against the real renderer once deployed — the method is final, only the numbers change.

How it was validated (ran against deployed M6)

COLD  call: 10.71s
WARM  call: 0.35s
BURST(8, cap=4): 3.92s, all_ok=True
RESULT: {standin: True, model: SmolLM2-135M, cold_s: 10.71, warm_s: 0.35, speedup: 30.6x}
VALIDATION PASSED: real cold start confirmed

⚠️ Stand-in numbers (SmolLM2-135M, not the real 75 GB FP8 model). Across two runs cold start varied 20.35s → 10.71s — exactly why real-model numbers must be measured on the real image, not extrapolated.

Code review (separate subagent) — CHANGES NEEDED → fixed

Finding Fix applied
Assertion passed on any positive timing — a surviving-warm container would be silently recorded as "cold" Added a cold-start sanity floor: cold_s > 2 and cold_s > 3×warm_s (fails if the container didn't actually scale to zero)
RESULT dict had no stand-in marker → numbers could be lifted as real Added standin: True + model to the machine-readable result
8-spawn "concurrency" conflates in-container concurrency (cap 4) with a cold scale-up Re-framed as a burst throughput smoke test (labeled BURST(8, cap=4)), not pure 8-wide concurrency
cold/warm division could divide by zero Guarded (None if warm_s==0)

Status: ✅ method validated; stand-in numbers. Real cold-start / per-GPU concurrency = the pending gate.