RL-Demo

Agent performing random actions (left and right)

Trained agent balancing the pole

The CartPole problem is a classic control problem, one of the simplest types of RL problems which can easily be solved using a DQN. The simulation/environment is said to be "solved" if the agent is able to balance the pole for an average of 195 timesteps or achieve a score of 195.0 over 100 consecutive trials or episodes. The agent trained after 250 episodes is shown above, balancing the pole in a single trial.

Agent performing random actions (left and right)

Trained agent building momentum to reach the flag

The Mountain Car problem is a classic control problem, one of the tricky types of RL problems which gives a reward signal to the agent only if it reaches a certain goal. In this case, since the engine isn't strong enough to directly move towards the top, the car has to build "momentum" by swinging back and forth to reach the top of the mountain. This is an example of the sparse-rewards problem in RL. It is solved in about 5000 episodes, using a Double DQN (DDQN) where a local network tries to learn parameters to get closer to a periodically fixed target network. One of the test runs by the agent is shown above.

Agent (right) performing random actions and losing

Trained agent (right) beating hard-coded AI (left) with superhuman performance

Pong is a simple 2-player game where the goal is to make ensure that your opponent is not able to return the ball in a volley. Using a small replay buffer of 10000, and training for about 700-800 episodes (about 1M timesteps), the DDQN agent converges with an average score of +18, the average human score being -3 (which means the hard-coded AI wins the game with a 3-point lead). Training for a larger number of timesteps (say, 10M) with a bigger replay buffer (say, 100K) is expected to achieve a perfect score of +21 (which means the agent wins every volley; it never loses).