Milestone · Boson AI · Modal overflow

Build & validation report

← Back to plan & tracker

Milestone 12 — Gateway overflow sidecar (dynamic weight) ✅

Goal: make the gateway spill to Modal only when in-house is full, sized to Modal's real live capacity, using a Modal API number — driving the gateway's weighted scheduler dynamically. Code: ~/modal_examples/milestones/m12_gateway_sidecar.py.

What was built

A sidecar that, every ~3 s:

  1. Polls ModalFunction.get_current_stats()num_total_runners, num_running_inputs, backlog (verified real fields in modal 1.5).
  2. Computes live free capacityrunners × max_inputs − running (open slots now) + a small warm-up allowance so a cold (0-runner) tier can still be woken — without advertising full scale-up capacity it can't instantly serve.
  3. Sets the gateway weight + max_conns for the Modal endpoint via the LB admin API (HAProxy Runtime API / nginx Plus / Envoy EDS):
    • 0 when in-house has room (hysteresis, start 0.85 / stop 0.60) → $0, no overflow
    • small warm-up weight to wake a cold tier
    • ramps with real capacity as Modal scales
    • 0 + shed when Modal is full (runners ≥ max_containers & no slots, or backlog rising)
    • max_conns = max_inputs × max_containers (hard cap)
  4. Fail-safe: after N consecutive poll errors, force weight → 0 (don't route to a tier you can't observe) + exponential backoff.

How it was validated (python3 m12_gateway_sidecar.py)

ok: capacity cold(warmup)/partial/ceiling/backlog correct
ok: hysteresis + capacity-ramped weight; cold tier woken without flooding
ok: weight snaps to 0 on drain (no stuck 0.1 sliver)
ok: Modal full -> weight 0 + shed
ok: run_loop [0.0, 9.0, 0.0]; fail-safe drops to 0 on API outage
VALIDATION PASSED

Code review (separate subagent) — CHANGES NEEDED → fixed

Finding (severity) Fix applied
Relied on input_headroom (reviewer: nonexistent) Re-verified it does exist in modal 1.5, but switched to unambiguous num_running_inputs (free = runners×max_inputs − running) to sidestep undocumented semantics
Cold-start over-promise (HIGH) — advertised full scale-up capacity into a cold tier Advertise free-now + small warm-up only; weight ramps as real runners materialize
Stuck residual weight 0.1 (EWMA never hit 0) Snap to 0 when target is 0 — no lingering overflow/cost
backlog ignored for saturation Folded into the full check
No fail-safe on API outage Force weight 0 + shed after N failures + backoff
Tests asserted impl back to itself Real-shape fake, drain-to-zero test, fail-safe test added

Note: the reviewer's one Critical (a "nonexistent field") was itself wrong — I verified against the installed SDK. The other findings were valid and are fixed.

Defaults (0.85/0.60 hysteresis, ~3 s poll, warm-up = one container) are illustrative — chosen, not measured; tune with real traffic. The test outputs ([0.0, 9.0, 0.0], etc.) are real (python3 m12_gateway_sidecar.py). FunctionStats fields were confirmed live via the SDK.

Status: ✅ validated — dynamic, capacity-aware overflow weighting with cold-start wake, snap-to-0, and fail-safe. See §16.1 for how it fits the weighted gateway.