DDCM Estimation — Root Cause Investigation

Why did δ converge to near-zero? Root cause: zone-matching bug made WORK utility invisible in U_obs. Workers-only sample also applied as correct methodology. (c_change was a separate earlier issue — already resolved.)

Higashihiroshima DDCM · K=10 NFXP · Investigation period: 2026-04-20 → 2026-05-13

✓ Both bugs fixed · v7 complete · v9 running — iter 4 best LL=−7590.01 (Δ+358 from v7)

0 Executive Summary (Meeting Brief)

Symptom

δ → 0.0408

Work utility near-zero at MLE; μ_home > δ → agents prefer home over work

Workers affected

701 / 702

Only 1 worker had WORK steps matched in v5 estimation

U_obs WORK contribution

WORK steps silently dropped from all observed paths

After fix (test group)

98.6%

Step match rate; WORK states at all 144 zones

Root cause 1

Zone bug in graph builder

Graph built with Person 0's zone only; WORK steps of all other workers dropped

Secondary issue

Non-workers excluded (methodology)

9 non-workers in v5 had minor impact; zone bug alone was sufficient. Workers-only is correct by design regardless.

Re-estimation

v9 iter 4 ✓

Best LL=−7590.01 · δ=0.043 · c_change=−0.770 · gradient ‖g‖∞=0.580 · converging

Note on c_change: The earlier c_change → −∞ problem was a separate issue, already fixed before this investigation by adding forbidden_mask to BackwardInduction.run() (blocks non-home HOME states). After that fix, c_change converged to −0.301. This investigation is specifically about why δ remained near-zero even after the c_change fix.

Bottom line: The only valid prior estimate is v5 (δ=0.0408, not valid). The primary cause is the zone bug — WORK states only existed for Person 0's zone, making U_obs completely δ-invariant for all workers. Non-worker exclusion (--workers-only) is correct methodology but was a minor secondary factor in v5 (9 persons out of ~600). Fix: zone-agnostic graph construction.

1 Investigation Timeline

2026-04-20 (earlier)

✓ Resolved: c_change → −∞ fixed by forbidden_mask in BI

Root cause: BackwardInduction.run() was never passed the forbidden_mask argument in any pipeline call. Without it, HOME activity was legal at every zone — agents could do aimless cross-zone HOME loops at zero cost, producing ~25 simulated trips/day vs ~3 observed. c_change was the only lever to suppress these HOME loops → drifted to −∞.
Fix: Pass forbidden_mask (blocks non-home HOME states with V=−∞) to BI in 4 pipeline locations. After fix, profile LL has a clean interior argmax and c_change converged to −0.301 (close to HH MNL prior of −0.3). ✓ Resolved.
This is a separate, earlier issue — NOT the subject of this investigation.

2026-04-20

New anomaly: δ still near-zero (0.0408) after c_change fix

After the forbidden_mask fix, c_change converged to −0.301 (sensible). But δ remained at 0.0408 — WORK marginal utility near-zero, meaning μ_home > δ and agents prefer staying home over going to work. This is the bug this investigation addresses.

2026-04-22 – 04-30

Tier B1 tests and profile LL scan — δ investigation begins

Three choice-set constraint tests all failed to fix δ identification. Profile LL scan on group 008 (44 workers) reveals U_obs is completely invariant to δ — changing δ from 0.01 to 0.14 moves U_obs by zero. See profile_ll_delta_muhome.html.

2026-05-12

Root cause found: zone-matching bug in graph construction

Inspection of group 008's saved graph metadata: 7,921 WORK states, all in CZONE_106 (Person 0's zone). 98 of 99 workers work at different zones → their WORK steps never match the graph → silently dropped from U_obs. Bug present in all 27 worker groups across all prior estimation runs.

2026-05-12

Fix designed and implemented (zone-agnostic mandatory sequence)

Build the shared group graph with mandatory_sequence = [(WORK, None)] so WORK states are generated at all zones. Zone-specific constraint replaced by zone-agnostic (any zone is valid). WORK utility in K=10 is purely time-based — zone enters only through travel cost, already priced in TRAVEL edges. Semantically sound.

2026-05-12

Zone fix tested and validated

Test script on 15 workers (3 unique work zones): 1,148,678 WORK states across 144 zones (PASS), 98.6% step match rate (PASS), V(s₀) finite for all home zones (PASS).

2026-05-12

Methodology fix: exclude non-workers from estimation sample

Non-workers' V(s₀) includes hypothetical WORK paths (δ affects it), but their U_obs has no WORK steps → LL_non-worker is monotone decreasing in δ. In v5, only 9 non-workers were present out of ~600 persons — minor numerical impact, with the zone bug being the dominant suppressor. Still, excluding non-workers is the correct approach by study design (matching Vastberg 2020): we are estimating work-activity preferences, so the sample should only include persons who actually work.

2026-05-12

Fix 2: workers-only sample · v6 launched

Added --workers-only flag to drop non-workers before grouping. v6 launched: 2000 persons → 793 workers (734 non-workers removed, 48%), 109 timing-window groups (min-group-size=5), zone-agnostic graphs, analytical gradient, warm start from v5 checkpoint.

2026-05-12

v6 crashed — transient CUDA error at group 025

OOM / CUDA device-side assert at group_025 Zone 18 stats loop (.item() sync point) after only ~1 iteration. Confirmed transient (v7 passed the same group cleanly). Root cause: RAM pressure from 109 timing-window groups (v5 grouping reused), each needing ~22 GB graph. Relaunched as v7 with --skip-phase1 and consolidated to 29 zone-agnostic groups (home-zone grouping).

2026-05-12 → 2026-05-13

v7 launched — first genuine estimation run (26 iterations)

Re-launched with --skip-phase1 (29 graphs pre-built on disk), --workers-only, --analytical-gradient, warm-start from v5 checkpoint. 702 persons, 569 valid LL contributors. BFGS ran 26 iterations — δ rose to 0.047 (genuine signal from WORK steps ✓). c_change drifted to −0.756 (monotone across breakthroughs — suspected BFGS overshoot). Best LL = −7948.54 at iter 26. Killed to avoid RAM OOM when profile scan launched simultaneously.

2026-05-13

Profile LL scan — c_change identified at −0.615

Fixed all 9 other K=10 params at v7 iter 26 checkpoint; swept c_change on 16-point grid [−2.5, −0.05] over all 29 groups (569 persons, ~4.5 hours). Clear interior maximum at c_change = −0.615 (LL = −7946.94). v7 checkpoint at −0.756 was only 1.6 LL units sub-optimal — BFGS undershot slightly. HH MNL prior (−0.301) firmly rejected (−99.9 LL units from peak). c_change is identified. See profile_ll_cchange_v7.html.

2026-05-13

v8 warm-start launched — killed at iter 1 (host RAM pressure)

Warm-start from v7 iter 26 checkpoint (c_change = −0.756). Completed iter 1 (LL = −7948.54, RAM = 13.3 GB — disk mode working correctly). Killed by host OOM: labuser's ML job (7.8 GB) + KDE desktop session exhausted shared host RAM. Log: estimation_results/nfxp_v8_workersonly_20260513.log.

2026-05-13

v9 warm-start — relaunching from v7 iter 26 checkpoint

Relaunch with identical settings as v8 once host RAM headroom confirmed. Disk mode confirmed working (13.3 GB peak per iter). No code changes needed. Warm-start: estimation_results/nfxp_checkpoint_20260513_142216.csv.

2 Root Cause: The Zone Bug

How graph groups are built

Persons are grouped by activity type + timing window (28 groups with --scheduling-preferences --timing-round 60). Each group shares one state-action graph built by the forward pass. Because graphs are expensive to build (~150s each), they are built once and reused across all NFXP iterations.

The bug: Person 0's zone used for the whole group

What the graph had
(buggy)

WORK @ CZONE_106 ✓

WORK @ CZONE_20 ✗

WORK @ CZONE_64 ✗

WORK @ CZONE_59 ✗

WORK @ … (141 more) ✗

7,921 WORK states total
all in Person 0's zone

→

What the graph needs
(fixed)

WORK @ CZONE_106 ✓

WORK @ CZONE_20 ✓

WORK @ CZONE_64 ✓

WORK @ CZONE_59 ✓

WORK @ all 144 zones ✓

~1.15M WORK states total
across all 144 zones

Why it was silent: The observed-path matching step (encode_observed_steps) looks up each step in the graph by hash. When a WORK state at the wrong zone is not found, it returns None and skips that step — no error, no warning. The LL computation simply excludes those steps from U_obs.

Impact on the likelihood

With bug (v1–v5)

        LL = Uobs(HOME + SHOP + LEIS) − V(s0; δ, …)
      

U_obs has no WORK terms → δ only appears in V(s₀), the denominator. LL always decreases as δ increases → optimizer pushes δ to smallest value consistent with α/β constraints. Result: δ = 0.0408 (near-zero, unstable).

With fix (v6)

        LL = Uobs(HOME + WORK + SHOP + LEIS) − V(s0; δ, …)
      

U_obs now includes WORK step utilities → δ shifts both U_obs and V(s₀). The LL surface has a proper interior maximum in δ — genuine identification from observed work episodes.

3 Why Earlier Runs Still "Converged"

Despite the bug, L-BFGS-B found a stationary point every time. This is expected — the LL is still smooth and bounded even with WORK steps missing. The optimizer converged to a genuine stationary point of the wrong objective.

What the optimizer was actually fitting

702 workers treated as de facto non-workers (no WORK in U_obs)
723 non-workers correctly fitted (HOME + SHOP + LEIS only)
δ identified only indirectly via V(s₀) interaction with α, β
β_shop, β_leis, c_change, μ_home estimated from non-work activity patterns

What the converged values mean

δ = 0.0408 — artifact; not identified from WORK observations
μ_home = 0.0617 > δ — consequence of δ being pushed low
c_change = −0.301 — already fixed separately by forbidden_mask in BI; not affected by zone bug
β_shop, β_leis — may be more reliable (non-workers correct)

Note on simulated vs observed comparison: Simulated workers did show WORK activity in previous behavioral validation — but all simulated workers commuted to CZONE_106 (Person 0's zone), not their actual work zones. Activity-type shares looked plausible but zones and travel times were wrong.

4 The Fix (4 Code Changes)

estimation/tensor_sampler.py

Build zone-agnostic mandatory sequence: [(WORK, None)] instead of [(WORK, person_0_zone)]. Generates WORK states at all 144 zones in the forward pass.

model/mandatory_sequence_utils.py

Allow None zone in sequence constraint check: (required_zone is None) OR (action.destination_zone == required_zone). Zone=None means any zone is valid.

planning/gpu_constraint_filter.py

Sentinel −1 for zone-agnostic slots in GPU tensor lookup. zone_ok = (required_zones < 0) | (action_zone == required_zones). Fixed for both tensor (GPU) and numpy (CPU) code paths.

planning/graph_builder_tensor.py

Fix int32 overflow in CSR row_ptr: zone-agnostic graphs produce ~2.23B edges, exceeding INT32_MAX (2.15B). Auto-select int64 when edge count exceeds 2B.

estimation/tensor_sampler.py — the one-line fix

- mandatory_seq = persons_group[0]['mandatory_sequence']

+ _raw_seq = persons_group[0]['mandatory_sequence']

+ mandatory_seq = [(act, None) for act, zone in _raw_seq] if _raw_seq else _raw_seq

# Zone=None → WORK states generated at all zones, not just Person 0's zone

Test results (15 workers, 3 unique work zones)

Check	Before fix	After fix	Result
WORK states in graph	7,921 states (1 zone)	1,148,678 states (144 zones)	PASS
Observed step match rate	~2.5% (WORK steps missing)	98.6%	PASS
Workers with WORK matched	1 / 99 (Person 0 only)	3 / 3 (all in test group)	PASS
V(s₀) finite	Finite but wrong (wrong zones)	Finite for all home zones	PASS
Graph size	5.76M states / 1.81B edges	7.07M states / 2.23B edges	~23% larger

5 Current Estimation Status

v9 running — iter 4 complete · best LL = −7590.01 (+358 units over v7) · δ=0.043 · c_change=−0.770 · ‖g‖∞=0.580 (converging) · RAM 15.8 GB · launched 2026-05-13

v9 iter 4 is the current best estimate. LL improved by 358 units over v7 (−7590 vs −7948) in just 4 BFGS iterations. δ = 0.043 (genuine, identified from WORK steps ✓). c_change = −0.770 (moving toward profile MLE at −0.615). Gradient norm shrinking (1.0 → 0.58) — convergence in progress.

Run	Persons	Outcome	Zone bug	δ result	c_change
v4 (Apr 2026)	1,527	Crashed — OOM	Yes	N/A	N/A
v5 ★ first convergence Zone bug active	~600 (reused v4 graphs)	Converged (18 iter) Hessian crashed post-opt	Yes — WORK steps invisible U_obs δ-invariant	0.0408 — artifact	−0.301 (fixed earlier)
v6 (May 2026)	702 workers	Crashed — CUDA OOM ~1 iter, group 025	Fixed	N/A — crashed	N/A
v7 ★ first genuine run Zone bug fixed	702 workers · 569 valid LL	26 iterations Killed (RAM — profile scan)	Fixed — zone-agnostic	0.047 — genuine ✓	−0.756 (near MLE)
v8 Warm-start from v7 iter 26	702 workers · 569 valid LL	Killed — iter 1 only Host RAM pressure	Fixed	N/A — killed	N/A
v9 ★ current best Warm-start from v7 iter 26	702 workers · 569 valid LL	4 iters · LL=−7590 Converging ‖g‖∞=0.58	Fixed	0.043 (iter 4)	−0.770 (iter 4) moving → −0.615

v7/v8/v9 settings: 2000 persons → 702 workers (734 non-workers removed, 48%) · 29 zone-agnostic groups (home-zone bucketing) · zone-agnostic mandatory sequence · analytical gradient · BFGS · warm start from v5 (v7) / v7 iter 26 (v8, v9).

6 Detailed Investigation Reports

Profile LL Analysis

2D Profile LL: δ × μ_home

Scans the LL surface over (δ, μ_home) for group 008. Confirms U_obs is completely invariant to δ. Interactive heatmap, contour, and 1D slices.

⚠ Written before root cause found. Attributed δ-invariance to "mandatory sequence outside CSR graph" — the real cause is the zone bug (§2 above). The symptom description is correct; the mechanism explanation is outdated.

Behavioral Analysis

Work Timing: Why Agents Delay Work

Analysis of simulated vs. observed work start times using the (buggy) v4/v5 estimates. Shows μ_home > δ as the mechanism for late work departure. Context for understanding what the corrected estimation should change.

⚠ Based on v4/v5 estimates with zone bug. Work timing results will change after v7/v8 converges.

v7 Profile LL · Latest

Profile LL: c_change Sweep (v7, Zone-Agnostic)

Sweeps c_change over 16 grid points [−2.5, −0.05] with all other K=10 params fixed at v7 iter 26. Clear interior maximum at −0.615 — c_change is identified. v7 checkpoint (−0.756) only 1.6 LL units sub-optimal. HH MNL prior (−0.301) firmly rejected (−99.9 LL units). Confirms v8 warm-start is on the right track.