Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.
We have adopted an implementation of Deep Q Network (DQN) from RLlib to train an autonomous Reinforcement Learining agent to land on surface of the moon. The agent interact with the environment by performing actions(a) and observing thier consequences in the form of rewards (r) and next states (s'). In order for an agent to remember these transitions, the network is equipped with a prioritized experience replay mechanism, which allows it to store upto 1 million transitions. The agent learns the mapping from states to action distributions by virtue of sampling experience tuples (s, a, r, s') from the memory buffer and updating the estimates of its internal value function (Q). It then acts greedily with respect to its value function, trying to maximize the expected sum of all the future rewards. Since the agent's value function only gives an approximation to the true value of the states and actions, simply maximizing the reward may lead to the agent being stuck in the local optimum. In order to avoid such situations, we encourage exploration by using a linear schedule for epsilon, annealed from 100% to 2% over 100000 steps, where epsilon is the probability of taking a random action in a given state.
The full code for the agent is available on github.