Milestone 10 — Monitoring + alerts ✅
Goal: function/app-level alerting (Modal owns hardware health). Code: ~/modal_examples/milestones/m10_monitoring.py.
What was built
An alert evaluator over a metrics snapshot. Modal owns machine/GPU hardware health (it drains/replaces bad GPUs), so we monitor at the app level:
- GPU memory ≥ 95% → CRITICAL (OOM — the #1 Modal-side risk given a marginal card).
- render failure rate, cold-start p99, shed-rate, queue depth → WARNING.
- budget pace, normalized by fraction-of-day elapsed → WARNING.
- Missing / stale metrics → WARNING — a blind pipeline is not "healthy."
How it was validated (python3 m10_monitoring.py)
ok: healthy -> no alerts
ok: missing metrics -> WARNING (not silent-healthy)
ok: stale metrics -> WARNING
ok: GPU mem boundary exact (>=0.95 fires, 0.949 does not)
ok: failures / cold-start / shed-rate / queue-depth each fire
ok: budget pace normalized by fraction-of-day
ok: all-bad snapshot fires every rule
VALIDATION PASSED
Code review (separate subagent) — CHANGES NEEDED → fixed
| Finding (severity) | Fix applied |
|---|---|
.get(metric, 0) made a missing/broken metric read as healthy (HIGH) — the classic monitoring anti-pattern |
Added missing-key + staleness detection that fires WARNING; evaluate({}) is no longer silent |
| Budget projection ignored time-of-day (MED) | Normalized by day_fraction elapsed |
| Missing leading-indicator alerts (MED) | Added shed-rate and queue-depth alerts |
| Tests skipped boundaries + missing-key (MED) | Added exact-boundary tests (0.95 fires, 0.949 doesn't) and a missing-key test |
Status: ✅ validated across healthy, missing/stale, every failure mode, and exact boundaries.