Watch an Agent Learn
At first it flails and the pole topples in a second. Leave it running: with nothing but a +1 per step reward and trial-and-error, the agent discovers how to balance — and the learning curve climbs past the “solved” line. Drag speed to fast-forward training, then “watch greedy” to see its best policy with the exploration turned off.
Overview
The cart can only push left or right. Balancing the pole is a control problem with no obvious rulebook — the agent has to discover the policy purely from the consequences of its actions.
Methodology
Tabular Q-learning over a discretized state: Q(s,a) ← Q + α[r + γ·maxₐ′Q(s′,a′) − Q], with ε-greedy exploration that decays as it gains confidence. The physics is the classic cart-pole (Barto–Sutton–Anderson, 1983).
Applications
Robotics and locomotion, process and traffic control, game-playing agents, recommendation, and the reward-driven learning thought to underlie dopamine signalling in the brain.