Relational Markov Decision Processes: Theory and Implementation for DDCM

Abstract: This report presents a comprehensive overview of Relational Markov Decision Processes (RMDPs) and their practical application to the Dynamic Discrete Choice Model (DDCM) for activity-based travel demand modeling. We bridge the theoretical foundations from Boutilier et al.’s seminal work on Symbolic Dynamic Programming with a practical GPU-accelerated implementation that achieves ~20,000× speedup for population-scale simulations.


1. Introduction: The Curse of Dimensionality in Standard MDPs

In a standard Markov Decision Process (MDP), you must explicitly enumerate every possible state and action. For travel demand modeling, this creates a fundamental scaling problem:

Example: For a logistics or travel problem with 144 zones, if each agent has a unique (home, work) configuration:

  • Unique graphs needed: 144 × 144 = 20,736 graphs
  • Graph build time: ~55 seconds each
  • Total time: 20,736 × 55s ≈ 316 hours (13 days)

This exponential explosion—the “curse of dimensionality”—makes standard MDPs impractical for real-world applications involving many locations, agents, or time periods.


2. Theoretical Foundations of RMDPs

2.1 From Propositional to First-Order Logic

RMDPs solve the dimensionality problem by borrowing ideas from First-Order Logic (FOL). Instead of a flat list of states, RMDPs describe the world using:

Component Description Example
Objects & Classes General categories rather than specific individuals Box, Truck, City, Zone
Relations (Fluents) Predicates describing state that change over time On(box, truck), In(truck, city)
Action Schemas Templates that apply to any objects drive(truck, city_a, city_b)

2.2 The Power of Quantifiers

First-Order Logic provides two quantifiers that enable abstraction:

  • Universal Quantifier (∀): “For all”
    • Example: ∀b, ¬On(b, Truck1, s) means “Truck 1 is empty”
  • Existential Quantifier (∃): “There exists”
    • Example: ∃b, In(b, Paris, s) means “At least one box is in Paris”

Key Insight: A single logical formula can describe policy for any number of objects:

\[\exists b, \text{On}(b, \text{Truck1}) \implies \text{Drive to Paris}\]

This applies whether you have 5 boxes or 500 boxes.


3. Symbolic Dynamic Programming (Boutilier et al., 2001)

3.1 Core Problem Addressed

Traditional MDP algorithms become impractical when:

  • State spaces grow exponentially with domain features
  • Problems involve relational fluents between objects
  • Domains contain unbounded or infinite numbers of objects
  • Quantification is needed (e.g., “there exists some box in Paris”)

3.2 Key Innovation: Situation Calculus

Boutilier’s approach uses situation calculus as the representational language:

Element Notation Example
Actions First-order terms putTable(b)
Situations Action sequences via do(action, s) do(stack(A,B), S₀)
Fluents Relations varying by situation BIn(b, Paris, s)

3.3 Decomposing Stochastic Actions

A critical trick: decompose stochastic actions into deterministic primitives with probabilistic outcomes:

Stochastic load(b,t) becomes:
  - loadS(b,t) — successful load (probability 0.9)
  - loadF(b,t) — failed load (probability 0.1)

3.4 Case Notation for Value Functions

Value functions are represented without state enumeration using case notation:

V(s) = case {
   ∃b. BIn(b, Rome, s) → 10,    // If any box is in Rome, value = 10
   ¬∃b. BIn(b, Rome, s) → 0     // Otherwise, value = 0
}

This applies regardless of how many boxes exist—a universal description.

3.5 First-Order Decision-Theoretic Regression (FODTR)

The core algorithm computes Q-functions symbolically:

\[Q^V(A(x), s) = R(s) + \gamma \sum_j \text{prob}(n_j(x) | A(x), s) \times V(do(n_j(x), s))\]

Key steps:

  1. Substitute case statements for R(s), prob(·), and V(·)
  2. Apply regression to move from future situation back to current
  3. Combine results using case operators

3.6 Dual Abstraction: States AND Actions

A crucial advantage is action abstraction. A single condition like:

On(b,t,s) ∧ TIn(t, Paris, s) ∧ ¬∃b'. BIn(b', Paris, s)

describes when to execute unload(b,t) for any box b and truck t satisfying these conditions—no enumeration needed.


4. Our RMDP Implementation: Zone Abstraction

4.1 Design Philosophy: Grounded RMDP with Parameterized State Abstraction

Our implementation achieves the practical benefits of RMDPs without the complexity of full symbolic logic operations:

Aspect Boutilier SDP Our RMDP Approach
State Representation First-order logic formulas Integer tensor (N, 8)
Value Function Logical case statements 1D numerical tensor
Computation Sequential symbolic ops GPU-parallel numerical
Scalability Infinite domains Fixed at 144 zones
Implementation Complex symbolic stack Standard DP

4.2 Abstract Zone Roles

Instead of encoding concrete zones in the state, we use abstract roles resolved at runtime:

ZoneID Value Type Description
HOME 0 Abstract Agent’s home zone
WORK 1 Abstract Agent’s work zone
SCHOOL 2 Abstract Agent’s school zone
ZONE_1...144 3-146 Concrete Direct mapping to CZONE

Key Insight: The decision structure is identical for all agents—only travel times differ. These are resolved at runtime via agent-specific ZoneBinding.

4.3 ZoneBinding: Runtime Resolution

# Alice's binding
alice_binding = ZoneBinding(
    home=Zone.CZONE_10,
    work=Zone.CZONE_50,
    school=None
)

# Bob's binding
bob_binding = ZoneBinding(
    home=Zone.CZONE_25,
    work=Zone.CZONE_73,
    school=None
)

# Same graph, different concrete zones at runtime
alice_binding.resolve(ZoneID.HOME)  # → Zone.CZONE_10
bob_binding.resolve(ZoneID.HOME)    # → Zone.CZONE_25

4.4 How RMDP Preserves Optimality

  1. Decision Structure: Actions available are identical for all agents
  2. Abstract Roles: Mandatory zones (HOME, WORK, SCHOOL) use abstract roles during planning
  3. Runtime Binding: Actual travel times resolved at simulation time
  4. Shared V-function: All agents use the same value function

5. Implementation Architecture

5.1 New Modules Created

File Purpose
core/zone_abstraction.py ZoneID enum, ZoneBinding class
core/state_rmdp.py StateRMDP, StateEncoderRMDP
planning/forward_pass_rmdp.py RMDP Forward Pass
planning/backward_induction_rmdp.py RMDP Backward Induction
planning/simulate_rmdp.py RMDP Simulation with binding resolution

5.2 Pipeline Timing (Full Day, 1440 minutes)

Step Time Output
RMDP Forward Pass ~8.4s 418,210 states, 21.8M edges
Graph Build (CSR) ~0.2s Sparse format, 6.7 MB
Backward Induction ~0.2s 81.6% valid states
Total ~8.7s Works for ALL (home, work) combos

6. Performance Comparison

Approach Graph Builds BI Runs Total Time (144×144) Speedup
Baseline (No Pruning) 20,736 20,736 71.1h × 20,736 ⚠️
Standard GPU 20,736 20,736 ~316 hours varies
Universal Graph 144 144 × 144 ~14 hours 23×
Universal + Basis Function 144 144 × 8 ~1.5 hours 210×
RMDP Zone Abstraction 1 1 ~50 seconds 29,382×

6.1 Resource Savings

Metric Before RMDP After RMDP Improvement
Graphs needed 20,736 1 20,736×
Build time (worst case) 316 hours 8.7 seconds ~131,000×
Memory 621 GB 30 MB 20,736×

7. Theoretical Connection: Boutilier → Our Implementation

Our Approach RMDP Concept
Abstract zone roles (HOME, WORK) First-order predicates
Zone bindings Instantiating quantified variables
Universal graph Case-statement partitioning of state space
Parameterized travel times Symbolic value terms in case expressions

Key Finding: Our approach achieves the practical benefits of RMDP (~20,000× speedup) without the implementation complexity of full symbolic logic operations.


8. Conclusion

RMDP Zone Abstraction represents the most impactful optimization in the DDCM project:

  1. ~20,000× speedup for population-scale simulations
  2. One universal graph for all (home, work, school) combinations
  3. Runtime binding resolves abstract zones to concrete zones
  4. Exact solutions maintained (same optimality as per-agent graphs)
  5. Simple implementation compared to full Symbolic Dynamic Programming

This enables simulating entire populations (thousands of agents with diverse home/work zones) in seconds rather than weeks.


References

  1. Boutilier, C., Reiter, R., & Price, B. (2001). Symbolic Dynamic Programming for First-Order MDPs. IJCAI.

  2. Powell, W.B. Reinforcement Learning and Stochastic Optimization.

  3. McCarthy et al. (2025). Activity-duration-dependent utility in a dynamic scheduling model.

  4. Västberg et al. (2020). A Dynamic Discrete Choice Activity-Based Travel Demand Model.


Last Updated: January 7, 2026