Milestone · Boson AI · Modal overflow

Build & validation report

← Back to plan & tracker

Milestone 10 — Monitoring + alerts ✅

Goal: function/app-level alerting (Modal owns hardware health). Code: ~/modal_examples/milestones/m10_monitoring.py.

What was built

An alert evaluator over a metrics snapshot. Modal owns machine/GPU hardware health (it drains/replaces bad GPUs), so we monitor at the app level:

How it was validated (python3 m10_monitoring.py)

ok: healthy -> no alerts
ok: missing metrics -> WARNING (not silent-healthy)
ok: stale metrics -> WARNING
ok: GPU mem boundary exact (>=0.95 fires, 0.949 does not)
ok: failures / cold-start / shed-rate / queue-depth each fire
ok: budget pace normalized by fraction-of-day
ok: all-bad snapshot fires every rule
VALIDATION PASSED

Code review (separate subagent) — CHANGES NEEDED → fixed

Finding (severity) Fix applied
.get(metric, 0) made a missing/broken metric read as healthy (HIGH) — the classic monitoring anti-pattern Added missing-key + staleness detection that fires WARNING; evaluate({}) is no longer silent
Budget projection ignored time-of-day (MED) Normalized by day_fraction elapsed
Missing leading-indicator alerts (MED) Added shed-rate and queue-depth alerts
Tests skipped boundaries + missing-key (MED) Added exact-boundary tests (0.95 fires, 0.949 doesn't) and a missing-key test

Status: ✅ validated across healthy, missing/stale, every failure mode, and exact boundaries.