Relational Markov Decision Processes: Theory and Implementation for DDCM
Relational Markov Decision Processes: Theory and Implementation for DDCM
Abstract: This report presents a comprehensive overview of Relational Markov Decision Processes (RMDPs) and their practical application to the Dynamic Discrete Choice Model (DDCM) for activity-based travel demand modeling. We bridge the theoretical foundations from Boutilier et al.’s seminal work on Symbolic Dynamic Programming with a practical GPU-accelerated implementation that achieves ~20,000× speedup for population-scale simulations.
1. Introduction: The Curse of Dimensionality in Standard MDPs
In a standard Markov Decision Process (MDP), you must explicitly enumerate every possible state and action. For travel demand modeling, this creates a fundamental scaling problem:
Example: For a logistics or travel problem with 144 zones, if each agent has a unique (home, work) configuration:
- Unique graphs needed: 144 × 144 = 20,736 graphs
- Graph build time: ~55 seconds each
- Total time: 20,736 × 55s ≈ 316 hours (13 days)
This exponential explosion—the “curse of dimensionality”—makes standard MDPs impractical for real-world applications involving many locations, agents, or time periods.
2. Theoretical Foundations of RMDPs
2.1 From Propositional to First-Order Logic
RMDPs solve the dimensionality problem by borrowing ideas from First-Order Logic (FOL). Instead of a flat list of states, RMDPs describe the world using:
| Component | Description | Example |
|---|---|---|
| Objects & Classes | General categories rather than specific individuals | Box, Truck, City, Zone |
| Relations (Fluents) | Predicates describing state that change over time | On(box, truck), In(truck, city) |
| Action Schemas | Templates that apply to any objects | drive(truck, city_a, city_b) |
2.2 The Power of Quantifiers
First-Order Logic provides two quantifiers that enable abstraction:
- Universal Quantifier (∀): “For all”
- Example: ∀b, ¬On(b, Truck1, s) means “Truck 1 is empty”
- Existential Quantifier (∃): “There exists”
- Example: ∃b, In(b, Paris, s) means “At least one box is in Paris”
Key Insight: A single logical formula can describe policy for any number of objects:
\[\exists b, \text{On}(b, \text{Truck1}) \implies \text{Drive to Paris}\]This applies whether you have 5 boxes or 500 boxes.
3. Symbolic Dynamic Programming (Boutilier et al., 2001)
3.1 Core Problem Addressed
Traditional MDP algorithms become impractical when:
- State spaces grow exponentially with domain features
- Problems involve relational fluents between objects
- Domains contain unbounded or infinite numbers of objects
- Quantification is needed (e.g., “there exists some box in Paris”)
3.2 Key Innovation: Situation Calculus
Boutilier’s approach uses situation calculus as the representational language:
| Element | Notation | Example |
|---|---|---|
| Actions | First-order terms | putTable(b) |
| Situations | Action sequences via do(action, s) |
do(stack(A,B), S₀) |
| Fluents | Relations varying by situation | BIn(b, Paris, s) |
3.3 Decomposing Stochastic Actions
A critical trick: decompose stochastic actions into deterministic primitives with probabilistic outcomes:
Stochastic load(b,t) becomes:
- loadS(b,t) — successful load (probability 0.9)
- loadF(b,t) — failed load (probability 0.1)
3.4 Case Notation for Value Functions
Value functions are represented without state enumeration using case notation:
V(s) = case {
∃b. BIn(b, Rome, s) → 10, // If any box is in Rome, value = 10
¬∃b. BIn(b, Rome, s) → 0 // Otherwise, value = 0
}
This applies regardless of how many boxes exist—a universal description.
3.5 First-Order Decision-Theoretic Regression (FODTR)
The core algorithm computes Q-functions symbolically:
\[Q^V(A(x), s) = R(s) + \gamma \sum_j \text{prob}(n_j(x) | A(x), s) \times V(do(n_j(x), s))\]Key steps:
- Substitute case statements for R(s), prob(·), and V(·)
- Apply regression to move from future situation back to current
- Combine results using case operators
3.6 Dual Abstraction: States AND Actions
A crucial advantage is action abstraction. A single condition like:
On(b,t,s) ∧ TIn(t, Paris, s) ∧ ¬∃b'. BIn(b', Paris, s)
describes when to execute unload(b,t) for any box b and truck t satisfying these conditions—no enumeration needed.
4. Our RMDP Implementation: Zone Abstraction
4.1 Design Philosophy: Grounded RMDP with Parameterized State Abstraction
Our implementation achieves the practical benefits of RMDPs without the complexity of full symbolic logic operations:
| Aspect | Boutilier SDP | Our RMDP Approach |
|---|---|---|
| State Representation | First-order logic formulas | Integer tensor (N, 8) |
| Value Function | Logical case statements | 1D numerical tensor |
| Computation | Sequential symbolic ops | GPU-parallel numerical |
| Scalability | Infinite domains | Fixed at 144 zones |
| Implementation | Complex symbolic stack | Standard DP |
4.2 Abstract Zone Roles
Instead of encoding concrete zones in the state, we use abstract roles resolved at runtime:
| ZoneID | Value | Type | Description |
|---|---|---|---|
HOME |
0 | Abstract | Agent’s home zone |
WORK |
1 | Abstract | Agent’s work zone |
SCHOOL |
2 | Abstract | Agent’s school zone |
ZONE_1...144 |
3-146 | Concrete | Direct mapping to CZONE |
Key Insight: The decision structure is identical for all agents—only travel times differ. These are resolved at runtime via agent-specific ZoneBinding.
4.3 ZoneBinding: Runtime Resolution
# Alice's binding
alice_binding = ZoneBinding(
home=Zone.CZONE_10,
work=Zone.CZONE_50,
school=None
)
# Bob's binding
bob_binding = ZoneBinding(
home=Zone.CZONE_25,
work=Zone.CZONE_73,
school=None
)
# Same graph, different concrete zones at runtime
alice_binding.resolve(ZoneID.HOME) # → Zone.CZONE_10
bob_binding.resolve(ZoneID.HOME) # → Zone.CZONE_25
4.4 How RMDP Preserves Optimality
- Decision Structure: Actions available are identical for all agents
- Abstract Roles: Mandatory zones (HOME, WORK, SCHOOL) use abstract roles during planning
- Runtime Binding: Actual travel times resolved at simulation time
- Shared V-function: All agents use the same value function
5. Implementation Architecture
5.1 New Modules Created
| File | Purpose |
|---|---|
core/zone_abstraction.py |
ZoneID enum, ZoneBinding class |
core/state_rmdp.py |
StateRMDP, StateEncoderRMDP |
planning/forward_pass_rmdp.py |
RMDP Forward Pass |
planning/backward_induction_rmdp.py |
RMDP Backward Induction |
planning/simulate_rmdp.py |
RMDP Simulation with binding resolution |
5.2 Pipeline Timing (Full Day, 1440 minutes)
| Step | Time | Output |
|---|---|---|
| RMDP Forward Pass | ~8.4s | 418,210 states, 21.8M edges |
| Graph Build (CSR) | ~0.2s | Sparse format, 6.7 MB |
| Backward Induction | ~0.2s | 81.6% valid states |
| Total | ~8.7s | Works for ALL (home, work) combos |
6. Performance Comparison
| Approach | Graph Builds | BI Runs | Total Time (144×144) | Speedup |
|---|---|---|---|---|
| Baseline (No Pruning) | 20,736 | 20,736 | 71.1h × 20,736 ⚠️ | 1× |
| Standard GPU | 20,736 | 20,736 | ~316 hours | varies |
| Universal Graph | 144 | 144 × 144 | ~14 hours | 23× |
| Universal + Basis Function | 144 | 144 × 8 | ~1.5 hours | 210× |
| RMDP Zone Abstraction | 1 | 1 | ~50 seconds | 29,382× |
6.1 Resource Savings
| Metric | Before RMDP | After RMDP | Improvement |
|---|---|---|---|
| Graphs needed | 20,736 | 1 | 20,736× |
| Build time (worst case) | 316 hours | 8.7 seconds | ~131,000× |
| Memory | 621 GB | 30 MB | 20,736× |
7. Theoretical Connection: Boutilier → Our Implementation
| Our Approach | RMDP Concept |
|---|---|
| Abstract zone roles (HOME, WORK) | First-order predicates |
| Zone bindings | Instantiating quantified variables |
| Universal graph | Case-statement partitioning of state space |
| Parameterized travel times | Symbolic value terms in case expressions |
Key Finding: Our approach achieves the practical benefits of RMDP (~20,000× speedup) without the implementation complexity of full symbolic logic operations.
8. Conclusion
RMDP Zone Abstraction represents the most impactful optimization in the DDCM project:
- ~20,000× speedup for population-scale simulations
- One universal graph for all (home, work, school) combinations
- Runtime binding resolves abstract zones to concrete zones
- Exact solutions maintained (same optimality as per-agent graphs)
- Simple implementation compared to full Symbolic Dynamic Programming
This enables simulating entire populations (thousands of agents with diverse home/work zones) in seconds rather than weeks.
References
-
Boutilier, C., Reiter, R., & Price, B. (2001). Symbolic Dynamic Programming for First-Order MDPs. IJCAI.
-
Powell, W.B. Reinforcement Learning and Stochastic Optimization.
-
McCarthy et al. (2025). Activity-duration-dependent utility in a dynamic scheduling model.
-
Västberg et al. (2020). A Dynamic Discrete Choice Activity-Based Travel Demand Model.
Last Updated: January 7, 2026