SARSA and Q-learning
It is also very useful for an agent to learn the action value function , which informs the agent about the long-term value of taking action in state so that the agent can take those actions that will maximize its expected, discounted future reward. The SARSA and Q-learning algorithms enable an agent to learn that! The following table summarizes the update equation for the SARSA algorithm and the Q-learning algorithm
SARSA is so named because of the sequence State->Action->Reward->State'->Action' that the algorithm's update step depends on. The description of the sequence goes like this: the agent, in state S, takes an action A and gets a reward R, and ends up in the next state S', after which the agent decides to take an action A' in the new state. Based on this experience, the agent can update its estimate of Q(S,A).
Q-learning is a popular off-policy learning algorithm, and it is similar to SARSA, except for one thing. Instead of using the Q value estimate for the new state and the action that the agent took in that new state, it uses the Q value estimate that corresponds to the action that leads to the maximum obtainable Q value from that new state, S'.