James Thomas // Dev

From A-Star to Markov Decision Processes: Policy over Perfection

The Definition Problem

When we look at a map to see how to get from point A to point B, we generally draw a line in our mind between the two points. Very rarely do we look at contingencies for each section of a map should we end up there, but that is precisely what we should do, at least to some extent, to ensure that our robots always have a path, even when they are thrown curveballs by the environment.

A-Star: The Mathematically Perfect Path

A* is the golden child of deterministic planning. It is elegant, mathematically sound, and computationally cheap. It works by combining the known cost from the start node with a heuristic guess to the goal, expressed in the classic function:

\(f(n)=g(n)+h(n)\)

If you have a static grid, A* guarantees the most optimal route. It draws the perfect line. But there is a massive catch: A* assumes the world is a frozen snapshot. It assumes that if you command a robot's motors to move 10 centimeters forward, the robot will end up exactly 10 centimeters forward.

In a sterile digital simulation, that is true. In physical reality—where wheel slip, gear backlash, and uneven friction exist—it is a fantasy. A* gives you a rigid sequence of steps. If the environment pushes your robot even slightly off that path, the entire plan is invalidated, forcing a computationally expensive re-plan from scratch. It is perfection at the cost of fragility.

Markov Decision Process (MDP): The Labor-Intensive Safe Plan

When the physical world pushes back, we have to shift our paradigm to reasoning under uncertainty. This is where the Markov Decision Process (MDP) steps in.

Instead of calculating a single, fragile path, an MDP calculates a Policy (\(\pi\)). A policy is an exhaustive rulebook that tells the robot exactly what the optimal action is, regardless of what state it accidentally wakes up in.

Consider the task of walking across the living room to get a cup of coffee. If it is 2:00 AM and the house is empty, A* works perfectly—you walk in a straight, deterministic line. But if you have a three-year-old son who is home all day, that living room becomes a highly stochastic environment. Toys appear out of nowhere, obstacles move, and your planned trajectory is constantly interrupted.

An MDP solves this by assigning a mathematical value to every single state in the environment, accounting for the probability of failure. This is done using the Bellman equation:

\(V(s)=\max_{a}\left(R(s,a)+\gamma\sum_{s'}P(s'|s,a)V(s')\right)\)

You are no longer asking, "What is the path to the goal?" You are asking the solver, "Given the reward of reaching the goal, and the probability that my actions might fail, what is the best possible move to make from my current square?"

The Trade-off: MDPs are computationally heavy. You must calculate the value of every state in the system before you take a single step. But once that policy is generated, the robot is practically invincible to environmental noise.

The Reality Gap (Synthesis)

We do not have to choose just one. In modern automation, the most resilient architectures stack these concepts:

Algorithm Role in Stack Analogy
A-Star The Global Planner The GPS route telling you to take the highway.
MDP The Local Planner The driver avoiding unexpected potholes along the way.

Do not build a robot that expects a perfect world. Build a robot with a policy for when the world inevitably goes wrong.