forbidden_mask to
BackwardInduction.run() (blocks non-home HOME states). After that fix, c_change
converged to −0.301. This investigation is specifically about why δ remained near-zero
even after the c_change fix.
--workers-only) is correct methodology but was a minor
secondary factor in v5 (9 persons out of ~600). Fix: zone-agnostic graph construction.
forbidden_mask in BIBackwardInduction.run() was never passed the
forbidden_mask argument in any pipeline call. Without it, HOME activity was
legal at every zone — agents could do aimless cross-zone HOME loops at zero cost,
producing ~25 simulated trips/day vs ~3 observed. c_change was the only lever to suppress
these HOME loops → drifted to −∞.forbidden_mask (blocks non-home HOME states with V=−∞)
to BI in 4 pipeline locations. After fix, profile LL has a clean interior argmax and
c_change converged to −0.301 (close to HH MNL prior of −0.3). ✓ Resolved.
--workers-only flag to drop non-workers before grouping.
v6 launched: 2000 persons → 793 workers (734 non-workers removed, 48%), 109 timing-window groups (min-group-size=5),
zone-agnostic graphs, analytical gradient, warm start from v5 checkpoint.
.item() sync point) after only ~1 iteration. Confirmed transient (v7 passed
the same group cleanly). Root cause: RAM pressure from 109 timing-window groups (v5 grouping
reused), each needing ~22 GB graph. Relaunched as v7 with --skip-phase1 and
consolidated to 29 zone-agnostic groups (home-zone grouping).
--skip-phase1 (29 graphs pre-built on disk),
--workers-only, --analytical-gradient, warm-start from v5 checkpoint.
702 persons, 569 valid LL contributors.
BFGS ran 26 iterations — δ rose to 0.047 (genuine signal from WORK steps ✓).
c_change drifted to −0.756 (monotone across breakthroughs — suspected BFGS overshoot).
Best LL = −7948.54 at iter 26. Killed to avoid RAM OOM when profile scan launched simultaneously.
estimation_results/nfxp_v8_workersonly_20260513.log.
estimation_results/nfxp_checkpoint_20260513_142216.csv.
Persons are grouped by activity type + timing window (28 groups with
--scheduling-preferences --timing-round 60). Each group shares one state-action graph
built by the forward pass. Because graphs are expensive to build (~150s each), they are built once
and reused across all NFXP iterations.
encode_observed_steps)
looks up each step in the graph by hash. When a WORK state at the wrong zone is not found, it
returns None and skips that step — no error, no warning. The LL computation simply
excludes those steps from Uobs.
Uobs has no WORK terms → δ only appears in V(s0), the denominator. LL always decreases as δ increases → optimizer pushes δ to smallest value consistent with α/β constraints. Result: δ = 0.0408 (near-zero, unstable).
Uobs now includes WORK step utilities → δ shifts both Uobs and V(s0). The LL surface has a proper interior maximum in δ — genuine identification from observed work episodes.
Despite the bug, L-BFGS-B found a stationary point every time. This is expected — the LL is still smooth and bounded even with WORK steps missing. The optimizer converged to a genuine stationary point of the wrong objective.
forbidden_mask in BI; not affected by zone bug[(WORK, None)] instead of
[(WORK, person_0_zone)]. Generates WORK states at all 144 zones in the forward pass.None zone in sequence constraint check:
(required_zone is None) OR (action.destination_zone == required_zone).
Zone=None means any zone is valid.zone_ok = (required_zones < 0) | (action_zone == required_zones).
Fixed for both tensor (GPU) and numpy (CPU) code paths.row_ptr: zone-agnostic graphs produce
~2.23B edges, exceeding INT32_MAX (2.15B). Auto-select int64 when edge count exceeds 2B.| Check | Before fix | After fix | Result |
|---|---|---|---|
| WORK states in graph | 7,921 states (1 zone) | 1,148,678 states (144 zones) | PASS |
| Observed step match rate | ~2.5% (WORK steps missing) | 98.6% | PASS |
| Workers with WORK matched | 1 / 99 (Person 0 only) | 3 / 3 (all in test group) | PASS |
| V(s0) finite | Finite but wrong (wrong zones) | Finite for all home zones | PASS |
| Graph size | 5.76M states / 1.81B edges | 7.07M states / 2.23B edges | ~23% larger |
| Run | Persons | Outcome | Zone bug | δ result | c_change |
|---|---|---|---|---|---|
| v4 (Apr 2026) | 1,527 | Crashed — OOM | Yes | N/A | N/A |
| v5 ★ first convergence Zone bug active |
~600 (reused v4 graphs) | Converged (18 iter) Hessian crashed post-opt |
Yes — WORK steps invisible Uobs δ-invariant |
0.0408 — artifact | −0.301 (fixed earlier) |
| v6 (May 2026) | 702 workers | Crashed — CUDA OOM ~1 iter, group 025 |
Fixed | N/A — crashed | N/A |
| v7 ★ first genuine run Zone bug fixed |
702 workers · 569 valid LL | 26 iterations Killed (RAM — profile scan) |
Fixed — zone-agnostic | 0.047 — genuine ✓ | −0.756 (near MLE) |
| v8 Warm-start from v7 iter 26 |
702 workers · 569 valid LL | Killed — iter 1 only Host RAM pressure |
Fixed | N/A — killed | N/A |
| v9 ★ current best Warm-start from v7 iter 26 |
702 workers · 569 valid LL | 4 iters · LL=−7590 Converging ‖g‖∞=0.58 |
Fixed | 0.043 (iter 4) | −0.770 (iter 4) moving → −0.615 |