Milestone 7 — Cold-start / per-GPU concurrency measured ✅ (stand-in numbers)
Goal: a re-runnable harness that measures the renderer's cold start, warm latency, and burst throughput. Code: ~/modal_examples/milestones/m7_measure.py.
What was built
A harness that, against the deployed M6 renderer:
- forces scale-to-zero (sleeps past
scaledown_window), then times a cold call, - times a warm call,
- fires an 8-spawn burst (Cls cap
max_inputs=4) for a throughput smoke test.
The same script runs verbatim against the real renderer once deployed — the method is final, only the numbers change.
How it was validated (ran against deployed M6)
COLD call: 10.71s
WARM call: 0.35s
BURST(8, cap=4): 3.92s, all_ok=True
RESULT: {standin: True, model: SmolLM2-135M, cold_s: 10.71, warm_s: 0.35, speedup: 30.6x}
VALIDATION PASSED: real cold start confirmed
⚠️ Stand-in numbers (SmolLM2-135M, not the real 75 GB FP8 model). Across two runs cold start varied 20.35s → 10.71s — exactly why real-model numbers must be measured on the real image, not extrapolated.
Code review (separate subagent) — CHANGES NEEDED → fixed
| Finding | Fix applied |
|---|---|
| Assertion passed on any positive timing — a surviving-warm container would be silently recorded as "cold" | Added a cold-start sanity floor: cold_s > 2 and cold_s > 3×warm_s (fails if the container didn't actually scale to zero) |
RESULT dict had no stand-in marker → numbers could be lifted as real |
Added standin: True + model to the machine-readable result |
| 8-spawn "concurrency" conflates in-container concurrency (cap 4) with a cold scale-up | Re-framed as a burst throughput smoke test (labeled BURST(8, cap=4)), not pure 8-wide concurrency |
cold/warm division could divide by zero |
Guarded (None if warm_s==0) |
Status: ✅ method validated; stand-in numbers. Real cold-start / per-GPU concurrency = the pending gate.