Relational Markov Decision Processes: Theory and Implementation for DDCM

Abstract: This report presents a comprehensive overview of Relational Markov Decision Processes (RMDPs) and their practical application to the Dynamic Discrete Choice Model (DDCM) for activity-based travel demand modeling. We bridge the theoretical foundations from Boutilier et al.’s seminal work on Symbolic Dynamic Programming with a practical GPU-accelerated implementation that achieves ~20,000× speedup for population-scale simulations.

1. Introduction: The Curse of Dimensionality in Standard MDPs

In a standard Markov Decision Process (MDP), you must explicitly enumerate every possible state and action. For travel demand modeling, this creates a fundamental scaling problem:

Example: For a logistics or travel problem with 144 zones, if each agent has a unique (home, work) configuration:

Unique graphs needed: 144 × 144 = 20,736 graphs
Graph build time: ~55 seconds each
Total time: 20,736 × 55s ≈ 316 hours (13 days)

This exponential explosion—the “curse of dimensionality”—makes standard MDPs impractical for real-world applications involving many locations, agents, or time periods.

2. Theoretical Foundations of RMDPs

2.1 From Propositional to First-Order Logic

RMDPs solve the dimensionality problem by borrowing ideas from First-Order Logic (FOL). Instead of a flat list of states, RMDPs describe the world using:

Component	Description	Example
Objects & Classes	General categories rather than specific individuals	`Box`, `Truck`, `City`, `Zone`
Relations (Fluents)	Predicates describing state that change over time	`On(box, truck)`, `In(truck, city)`
Action Schemas	Templates that apply to any objects	`drive(truck, city_a, city_b)`

2.2 The Power of Quantifiers

First-Order Logic provides two quantifiers that enable abstraction:

Universal Quantifier (∀): “For all”
- Example: ∀b, ¬On(b, Truck1, s) means “Truck 1 is empty”
Existential Quantifier (∃): “There exists”
- Example: ∃b, In(b, Paris, s) means “At least one box is in Paris”

Key Insight: A single logical formula can describe policy for any number of objects:

\[\exists b, \text{On}(b, \text{Truck1}) \implies \text{Drive to Paris}\]

This applies whether you have 5 boxes or 500 boxes.

3. Symbolic Dynamic Programming (Boutilier et al., 2001)

3.1 Core Problem Addressed

Traditional MDP algorithms become impractical when:

State spaces grow exponentially with domain features
Problems involve relational fluents between objects
Domains contain unbounded or infinite numbers of objects
Quantification is needed (e.g., “there exists some box in Paris”)

3.2 Key Innovation: Situation Calculus

Boutilier’s approach uses situation calculus as the representational language:

Element	Notation	Example
Actions	First-order terms	`putTable(b)`
Situations	Action sequences via `do(action, s)`	`do(stack(A,B), S₀)`
Fluents	Relations varying by situation	`BIn(b, Paris, s)`

3.3 Decomposing Stochastic Actions

A critical trick: decompose stochastic actions into deterministic primitives with probabilistic outcomes:

Stochastic load(b,t) becomes:
  - loadS(b,t) — successful load (probability 0.9)
  - loadF(b,t) — failed load (probability 0.1)

3.4 Case Notation for Value Functions

Value functions are represented without state enumeration using case notation:

V(s) = case {
   ∃b. BIn(b, Rome, s) → 10,    // If any box is in Rome, value = 10
   ¬∃b. BIn(b, Rome, s) → 0     // Otherwise, value = 0
}

This applies regardless of how many boxes exist—a universal description.

3.5 First-Order Decision-Theoretic Regression (FODTR)

The core algorithm computes Q-functions symbolically:

\[Q^V(A(x), s) = R(s) + \gamma \sum_j \text{prob}(n_j(x) | A(x), s) \times V(do(n_j(x), s))\]

Key steps:

Substitute case statements for R(s), prob(·), and V(·)
Apply regression to move from future situation back to current
Combine results using case operators

3.6 Dual Abstraction: States AND Actions

A crucial advantage is action abstraction. A single condition like:

On(b,t,s) ∧ TIn(t, Paris, s) ∧ ¬∃b'. BIn(b', Paris, s)

describes when to execute unload(b,t) for any box b and truck t satisfying these conditions—no enumeration needed.

4. Our RMDP Implementation: Zone Abstraction

4.1 Design Philosophy: Grounded RMDP with Parameterized State Abstraction

Our implementation achieves the practical benefits of RMDPs without the complexity of full symbolic logic operations:

Aspect	Boutilier SDP	Our RMDP Approach
State Representation	First-order logic formulas	Integer tensor (N, 8)
Value Function	Logical case statements	1D numerical tensor
Computation	Sequential symbolic ops	GPU-parallel numerical
Scalability	Infinite domains	Fixed at 144 zones
Implementation	Complex symbolic stack	Standard DP

4.2 Abstract Zone Roles

Instead of encoding concrete zones in the state, we use abstract roles resolved at runtime:

ZoneID	Value	Type	Description
`HOME`	0	Abstract	Agent’s home zone
`WORK`	1	Abstract	Agent’s work zone
`SCHOOL`	2	Abstract	Agent’s school zone
`ZONE_1...144`	3-146	Concrete	Direct mapping to CZONE

Key Insight: The decision structure is identical for all agents—only travel times differ. These are resolved at runtime via agent-specific ZoneBinding.

4.3 ZoneBinding: Runtime Resolution

# Alice's binding
alice_binding = ZoneBinding(
    home=Zone.CZONE_10,
    work=Zone.CZONE_50,
    school=None
)

# Bob's binding
bob_binding = ZoneBinding(
    home=Zone.CZONE_25,
    work=Zone.CZONE_73,
    school=None
)

# Same graph, different concrete zones at runtime
alice_binding.resolve(ZoneID.HOME)  # → Zone.CZONE_10
bob_binding.resolve(ZoneID.HOME)    # → Zone.CZONE_25

4.4 How RMDP Preserves Optimality

Decision Structure: Actions available are identical for all agents
Abstract Roles: Mandatory zones (HOME, WORK, SCHOOL) use abstract roles during planning
Runtime Binding: Actual travel times resolved at simulation time
Shared V-function: All agents use the same value function

5. Implementation Architecture

5.1 New Modules Created

File	Purpose
`core/zone_abstraction.py`	ZoneID enum, ZoneBinding class
`core/state_rmdp.py`	StateRMDP, StateEncoderRMDP
`planning/forward_pass_rmdp.py`	RMDP Forward Pass
`planning/backward_induction_rmdp.py`	RMDP Backward Induction
`planning/simulate_rmdp.py`	RMDP Simulation with binding resolution

5.2 Pipeline Timing (Full Day, 1440 minutes)

Step	Time	Output
RMDP Forward Pass	~8.4s	418,210 states, 21.8M edges
Graph Build (CSR)	~0.2s	Sparse format, 6.7 MB
Backward Induction	~0.2s	81.6% valid states
Total	~8.7s	Works for ALL (home, work) combos

6. Performance Comparison

Approach	Graph Builds	BI Runs	Total Time (144×144)	Speedup
Baseline (No Pruning)	20,736	20,736	71.1h × 20,736 ⚠️	1×
Standard GPU	20,736	20,736	~316 hours	varies
Universal Graph	144	144 × 144	~14 hours	23×
Universal + Basis Function	144	144 × 8	~1.5 hours	210×
RMDP Zone Abstraction	1	1	~50 seconds	29,382×

6.1 Resource Savings

Metric	Before RMDP	After RMDP	Improvement
Graphs needed	20,736	1	20,736×
Build time (worst case)	316 hours	8.7 seconds	~131,000×
Memory	621 GB	30 MB	20,736×

7. Theoretical Connection: Boutilier → Our Implementation

Our Approach	RMDP Concept
Abstract zone roles (HOME, WORK)	First-order predicates
Zone bindings	Instantiating quantified variables
Universal graph	Case-statement partitioning of state space
Parameterized travel times	Symbolic value terms in case expressions

Key Finding: Our approach achieves the practical benefits of RMDP (~20,000× speedup) without the implementation complexity of full symbolic logic operations.

8. Conclusion

RMDP Zone Abstraction represents the most impactful optimization in the DDCM project:

~20,000× speedup for population-scale simulations
One universal graph for all (home, work, school) combinations
Runtime binding resolves abstract zones to concrete zones
Exact solutions maintained (same optimality as per-agent graphs)
Simple implementation compared to full Symbolic Dynamic Programming

This enables simulating entire populations (thousands of agents with diverse home/work zones) in seconds rather than weeks.

References

Boutilier, C., Reiter, R., & Price, B. (2001). Symbolic Dynamic Programming for First-Order MDPs. IJCAI.
Powell, W.B. Reinforcement Learning and Stochastic Optimization.
McCarthy et al. (2025). Activity-duration-dependent utility in a dynamic scheduling model.
Västberg et al. (2020). A Dynamic Discrete Choice Activity-Based Travel Demand Model.

Last Updated: January 7, 2026