Milestone 10 — Monitoring + alerts ✅

Goal: function/app-level alerting (Modal owns hardware health). Code: ~/modal_examples/milestones/m10_monitoring.py.

What was built

An alert evaluator over a metrics snapshot. Modal owns machine/GPU hardware health (it drains/replaces bad GPUs), so we monitor at the app level:

GPU memory ≥ 95% → CRITICAL (OOM — the #1 Modal-side risk given a marginal card).
render failure rate, cold-start p99, shed-rate, queue depth → WARNING.
budget pace, normalized by fraction-of-day elapsed → WARNING.
Missing / stale metrics → WARNING — a blind pipeline is not "healthy."

How it was validated (`python3 m10_monitoring.py`)

ok: healthy -> no alerts
ok: missing metrics -> WARNING (not silent-healthy)
ok: stale metrics -> WARNING
ok: GPU mem boundary exact (>=0.95 fires, 0.949 does not)
ok: failures / cold-start / shed-rate / queue-depth each fire
ok: budget pace normalized by fraction-of-day
ok: all-bad snapshot fires every rule
VALIDATION PASSED

Code review (separate subagent) — CHANGES NEEDED → fixed

Finding (severity)	Fix applied
`.get(metric, 0)` made a missing/broken metric read as healthy (HIGH) — the classic monitoring anti-pattern	Added missing-key + staleness detection that fires WARNING; `evaluate({})` is no longer silent
Budget projection ignored time-of-day (MED)	Normalized by `day_fraction` elapsed
Missing leading-indicator alerts (MED)	Added shed-rate and queue-depth alerts
Tests skipped boundaries + missing-key (MED)	Added exact-boundary tests (`0.95` fires, `0.949` doesn't) and a missing-key test

Status: ✅ validated across healthy, missing/stale, every failure mode, and exact boundaries.

Build & validation report

Milestone 10 — Monitoring + alerts ✅

What was built

How it was validated (python3 m10_monitoring.py)

Code review (separate subagent) — CHANGES NEEDED → fixed

How it was validated (`python3 m10_monitoring.py`)