DDCM Estimation — Root Cause Investigation

Why did δ converge to near-zero? Root cause: zone-matching bug made WORK utility invisible in Uobs. Workers-only sample also applied as correct methodology. (c_change was a separate earlier issue — already resolved.)
Higashihiroshima DDCM · K=10 NFXP · Investigation period: 2026-04-20 → 2026-05-13
✓ Both bugs fixed · v7 complete · v9 running — iter 4 best LL=−7590.01 (Δ+358 from v7)

0 Executive Summary (Meeting Brief)

Symptom
δ → 0.0408
Work utility near-zero at MLE; μhome > δ → agents prefer home over work
Workers affected
701 / 702
Only 1 worker had WORK steps matched in v5 estimation
Uobs WORK contribution
0%
WORK steps silently dropped from all observed paths
After fix (test group)
98.6%
Step match rate; WORK states at all 144 zones
Root cause 1
Zone bug in graph builder
Graph built with Person 0's zone only; WORK steps of all other workers dropped
Secondary issue
Non-workers excluded (methodology)
9 non-workers in v5 had minor impact; zone bug alone was sufficient. Workers-only is correct by design regardless.
Re-estimation
v9 iter 4 ✓
Best LL=−7590.01 · δ=0.043 · c_change=−0.770 · gradient ‖g‖∞=0.580 · converging
Note on c_change: The earlier c_change → −∞ problem was a separate issue, already fixed before this investigation by adding forbidden_mask to BackwardInduction.run() (blocks non-home HOME states). After that fix, c_change converged to −0.301. This investigation is specifically about why δ remained near-zero even after the c_change fix.
Bottom line: The only valid prior estimate is v5 (δ=0.0408, not valid). The primary cause is the zone bug — WORK states only existed for Person 0's zone, making Uobs completely δ-invariant for all workers. Non-worker exclusion (--workers-only) is correct methodology but was a minor secondary factor in v5 (9 persons out of ~600). Fix: zone-agnostic graph construction.

1 Investigation Timeline

2026-04-20 (earlier)
✓ Resolved: c_change → −∞ fixed by forbidden_mask in BI
Root cause: BackwardInduction.run() was never passed the forbidden_mask argument in any pipeline call. Without it, HOME activity was legal at every zone — agents could do aimless cross-zone HOME loops at zero cost, producing ~25 simulated trips/day vs ~3 observed. c_change was the only lever to suppress these HOME loops → drifted to −∞.
Fix: Pass forbidden_mask (blocks non-home HOME states with V=−∞) to BI in 4 pipeline locations. After fix, profile LL has a clean interior argmax and c_change converged to −0.301 (close to HH MNL prior of −0.3). ✓ Resolved.
This is a separate, earlier issue — NOT the subject of this investigation.
2026-04-20
New anomaly: δ still near-zero (0.0408) after c_change fix
After the forbidden_mask fix, c_change converged to −0.301 (sensible). But δ remained at 0.0408 — WORK marginal utility near-zero, meaning μhome > δ and agents prefer staying home over going to work. This is the bug this investigation addresses.
2026-04-22 – 04-30
Tier B1 tests and profile LL scan — δ investigation begins
Three choice-set constraint tests all failed to fix δ identification. Profile LL scan on group 008 (44 workers) reveals Uobs is completely invariant to δ — changing δ from 0.01 to 0.14 moves Uobs by zero. See profile_ll_delta_muhome.html.
2026-05-12
Root cause found: zone-matching bug in graph construction
Inspection of group 008's saved graph metadata: 7,921 WORK states, all in CZONE_106 (Person 0's zone). 98 of 99 workers work at different zones → their WORK steps never match the graph → silently dropped from Uobs. Bug present in all 27 worker groups across all prior estimation runs.
2026-05-12
Fix designed and implemented (zone-agnostic mandatory sequence)
Build the shared group graph with mandatory_sequence = [(WORK, None)] so WORK states are generated at all zones. Zone-specific constraint replaced by zone-agnostic (any zone is valid). WORK utility in K=10 is purely time-based — zone enters only through travel cost, already priced in TRAVEL edges. Semantically sound.
2026-05-12
Zone fix tested and validated
Test script on 15 workers (3 unique work zones): 1,148,678 WORK states across 144 zones (PASS), 98.6% step match rate (PASS), V(s0) finite for all home zones (PASS).
2026-05-12
Methodology fix: exclude non-workers from estimation sample
Non-workers' V(s0) includes hypothetical WORK paths (δ affects it), but their Uobs has no WORK steps → LLnon-worker is monotone decreasing in δ. In v5, only 9 non-workers were present out of ~600 persons — minor numerical impact, with the zone bug being the dominant suppressor. Still, excluding non-workers is the correct approach by study design (matching Vastberg 2020): we are estimating work-activity preferences, so the sample should only include persons who actually work.
2026-05-12
Fix 2: workers-only sample · v6 launched
Added --workers-only flag to drop non-workers before grouping. v6 launched: 2000 persons → 793 workers (734 non-workers removed, 48%), 109 timing-window groups (min-group-size=5), zone-agnostic graphs, analytical gradient, warm start from v5 checkpoint.
2026-05-12
v6 crashed — transient CUDA error at group 025
OOM / CUDA device-side assert at group_025 Zone 18 stats loop (.item() sync point) after only ~1 iteration. Confirmed transient (v7 passed the same group cleanly). Root cause: RAM pressure from 109 timing-window groups (v5 grouping reused), each needing ~22 GB graph. Relaunched as v7 with --skip-phase1 and consolidated to 29 zone-agnostic groups (home-zone grouping).
2026-05-12 → 2026-05-13
v7 launched — first genuine estimation run (26 iterations)
Re-launched with --skip-phase1 (29 graphs pre-built on disk), --workers-only, --analytical-gradient, warm-start from v5 checkpoint. 702 persons, 569 valid LL contributors. BFGS ran 26 iterations — δ rose to 0.047 (genuine signal from WORK steps ✓). c_change drifted to −0.756 (monotone across breakthroughs — suspected BFGS overshoot). Best LL = −7948.54 at iter 26. Killed to avoid RAM OOM when profile scan launched simultaneously.
2026-05-13
Profile LL scan — c_change identified at −0.615
Fixed all 9 other K=10 params at v7 iter 26 checkpoint; swept c_change on 16-point grid [−2.5, −0.05] over all 29 groups (569 persons, ~4.5 hours). Clear interior maximum at c_change = −0.615 (LL = −7946.94). v7 checkpoint at −0.756 was only 1.6 LL units sub-optimal — BFGS undershot slightly. HH MNL prior (−0.301) firmly rejected (−99.9 LL units from peak). c_change is identified. See profile_ll_cchange_v7.html.
2026-05-13
v8 warm-start launched — killed at iter 1 (host RAM pressure)
Warm-start from v7 iter 26 checkpoint (c_change = −0.756). Completed iter 1 (LL = −7948.54, RAM = 13.3 GB — disk mode working correctly). Killed by host OOM: labuser's ML job (7.8 GB) + KDE desktop session exhausted shared host RAM. Log: estimation_results/nfxp_v8_workersonly_20260513.log.
2026-05-13
v9 warm-start — relaunching from v7 iter 26 checkpoint
Relaunch with identical settings as v8 once host RAM headroom confirmed. Disk mode confirmed working (13.3 GB peak per iter). No code changes needed. Warm-start: estimation_results/nfxp_checkpoint_20260513_142216.csv.

2 Root Cause: The Zone Bug

How graph groups are built

Persons are grouped by activity type + timing window (28 groups with --scheduling-preferences --timing-round 60). Each group shares one state-action graph built by the forward pass. Because graphs are expensive to build (~150s each), they are built once and reused across all NFXP iterations.

The bug: Person 0's zone used for the whole group

What the graph had
(buggy)
WORK @ CZONE_106 ✓
WORK @ CZONE_20 ✗
WORK @ CZONE_64 ✗
WORK @ CZONE_59 ✗
WORK @ … (141 more) ✗
7,921 WORK states total
all in Person 0's zone
What the graph needs
(fixed)
WORK @ CZONE_106 ✓
WORK @ CZONE_20 ✓
WORK @ CZONE_64 ✓
WORK @ CZONE_59 ✓
WORK @ all 144 zones ✓
~1.15M WORK states total
across all 144 zones
Why it was silent: The observed-path matching step (encode_observed_steps) looks up each step in the graph by hash. When a WORK state at the wrong zone is not found, it returns None and skips that step — no error, no warning. The LL computation simply excludes those steps from Uobs.

Impact on the likelihood

With bug (v1–v5)
LL = Uobs(HOME + SHOP + LEIS) − V(s0; δ, …)

Uobs has no WORK terms → δ only appears in V(s0), the denominator. LL always decreases as δ increases → optimizer pushes δ to smallest value consistent with α/β constraints. Result: δ = 0.0408 (near-zero, unstable).

With fix (v6)
LL = Uobs(HOME + WORK + SHOP + LEIS) − V(s0; δ, …)

Uobs now includes WORK step utilities → δ shifts both Uobs and V(s0). The LL surface has a proper interior maximum in δ — genuine identification from observed work episodes.

3 Why Earlier Runs Still "Converged"

Despite the bug, L-BFGS-B found a stationary point every time. This is expected — the LL is still smooth and bounded even with WORK steps missing. The optimizer converged to a genuine stationary point of the wrong objective.

What the optimizer was actually fitting

  • 702 workers treated as de facto non-workers (no WORK in Uobs)
  • 723 non-workers correctly fitted (HOME + SHOP + LEIS only)
  • δ identified only indirectly via V(s0) interaction with α, β
  • βshop, βleis, c_change, μhome estimated from non-work activity patterns

What the converged values mean

  • δ = 0.0408 — artifact; not identified from WORK observations
  • μhome = 0.0617 > δ — consequence of δ being pushed low
  • c_change = −0.301 — already fixed separately by forbidden_mask in BI; not affected by zone bug
  • βshop, βleis — may be more reliable (non-workers correct)
Note on simulated vs observed comparison: Simulated workers did show WORK activity in previous behavioral validation — but all simulated workers commuted to CZONE_106 (Person 0's zone), not their actual work zones. Activity-type shares looked plausible but zones and travel times were wrong.

4 The Fix (4 Code Changes)

estimation/tensor_sampler.py
Build zone-agnostic mandatory sequence: [(WORK, None)] instead of [(WORK, person_0_zone)]. Generates WORK states at all 144 zones in the forward pass.
model/mandatory_sequence_utils.py
Allow None zone in sequence constraint check: (required_zone is None) OR (action.destination_zone == required_zone). Zone=None means any zone is valid.
planning/gpu_constraint_filter.py
Sentinel −1 for zone-agnostic slots in GPU tensor lookup. zone_ok = (required_zones < 0) | (action_zone == required_zones). Fixed for both tensor (GPU) and numpy (CPU) code paths.
planning/graph_builder_tensor.py
Fix int32 overflow in CSR row_ptr: zone-agnostic graphs produce ~2.23B edges, exceeding INT32_MAX (2.15B). Auto-select int64 when edge count exceeds 2B.
estimation/tensor_sampler.py — the one-line fix
- mandatory_seq = persons_group[0]['mandatory_sequence']
+ _raw_seq = persons_group[0]['mandatory_sequence']
+ mandatory_seq = [(act, None) for act, zone in _raw_seq] if _raw_seq else _raw_seq
# Zone=None → WORK states generated at all zones, not just Person 0's zone

Test results (15 workers, 3 unique work zones)

CheckBefore fixAfter fixResult
WORK states in graph 7,921 states (1 zone) 1,148,678 states (144 zones) PASS
Observed step match rate ~2.5% (WORK steps missing) 98.6% PASS
Workers with WORK matched 1 / 99 (Person 0 only) 3 / 3 (all in test group) PASS
V(s0) finite Finite but wrong (wrong zones) Finite for all home zones PASS
Graph size 5.76M states / 1.81B edges 7.07M states / 2.23B edges ~23% larger

5 Current Estimation Status

v9 running — iter 4 complete · best LL = −7590.01 (+358 units over v7) · δ=0.043 · c_change=−0.770 · ‖g‖∞=0.580 (converging) · RAM 15.8 GB · launched 2026-05-13
v9 iter 4 is the current best estimate. LL improved by 358 units over v7 (−7590 vs −7948) in just 4 BFGS iterations. δ = 0.043 (genuine, identified from WORK steps ✓). c_change = −0.770 (moving toward profile MLE at −0.615). Gradient norm shrinking (1.0 → 0.58) — convergence in progress.
RunPersonsOutcomeZone bugδ resultc_change
v4 (Apr 2026) 1,527 Crashed — OOM Yes N/A N/A
v5 ★ first convergence
Zone bug active
~600 (reused v4 graphs) Converged (18 iter)
Hessian crashed post-opt
Yes — WORK steps invisible
Uobs δ-invariant
0.0408 — artifact −0.301 (fixed earlier)
v6 (May 2026) 702 workers Crashed — CUDA OOM
~1 iter, group 025
Fixed N/A — crashed N/A
v7 ★ first genuine run
Zone bug fixed
702 workers · 569 valid LL 26 iterations
Killed (RAM — profile scan)
Fixed — zone-agnostic 0.047 — genuine ✓ −0.756 (near MLE)
v8
Warm-start from v7 iter 26
702 workers · 569 valid LL Killed — iter 1 only
Host RAM pressure
Fixed N/A — killed N/A
v9 ★ current best
Warm-start from v7 iter 26
702 workers · 569 valid LL 4 iters · LL=−7590
Converging ‖g‖∞=0.58
Fixed 0.043 (iter 4) −0.770 (iter 4)
moving → −0.615
v7/v8/v9 settings: 2000 persons → 702 workers (734 non-workers removed, 48%) · 29 zone-agnostic groups (home-zone bucketing) · zone-agnostic mandatory sequence · analytical gradient · BFGS · warm start from v5 (v7) / v7 iter 26 (v8, v9).

6 Detailed Investigation Reports