June 21, 2021

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning

Emulating the natural learning process has been an inspiration for developing many machine learning algorithms. Reinforcement learning is one of them.
We have all encountered many occasions as kids where we learnt a behaviour through positive or negative reinforcement. Think of the first time you touched a thorny rose stem and figured that next time you should probably grab it more gently. This was a learning experience through negative reinforcement. When your parents praised you or got you a gift when you scored an A on that math test, that was a positive reinforcement learning instance where you learnt that studying hard and getting good grades is good and something to be repeated, or at least your parents thought that.
But these learning encounters do stop at childhood. For instance, you are more likely to remember (learn) to buckle up if the car doesn't stop beeping until you secure your seatbelt (negative reinforcement). You have likely implemented some form of reinforcement to train your dog, and the list goes on...
The common factor in all these learning examples is the presence of some reinforcer, positive or negative, that increases a desired behaviour, e.g. avoiding thorns, getting As, closing your seatbelt, etc. These reinforcers can be a natural consequence of our behaviour as is the case when we feel pain by touching sharp thorns, or they could be an external intervention as was the case when your parents got you a gift for being a hard working student.

Seeing the effectiveness of this learning paradigm, the machine learning community tried simulating it. We call this subset of machine learning reinforcement learning (RL). There are various types of RL algorithms and they are applied to different domains. The most common one is in gaming, a famous example is DeepMind's AlphaGo that beat the world Go champion. You can find a full documentary of this project here.
Other use cases of RL are in the finance sector for optimising clients' portfolios or in recommendation engines where the goal is to recommend the products that the user will most likely purchase.
But the question is how are these RL algorithms designed to replicate the natural learning process we experience in order to learn Go or find the best recommendation?
In this article, I will focus on the basic terminology and different types of RL algorithms.

Definition and terminology

I briefly touched upon what RL is in a broader scope, but what is its definition and objective in the machine learning world?
RL is a subset of machine learning which involves creating software that can learn a strategy for behaving in a desired way. In order to understand this better let's first define some terminology:

Environment (E) and Agent (A)

The agent is the entity that we want to learn desired behaviours. This could be a robot we are trying to get to learn playing football or the virtual player in a chess game. The environment is the system that the agent interacts with in its learning process. In the robot example, it's the football field, and in the chess game it's the chess board.

State (S)

The environment is made up of smaller units called states that contain information about the environment. For example, each position on the chess board is a state that has positions to its top, bottom, left, right and at a particular time may have the opponent's bishop to its right. This information helps the agent in deciding how to interact with the environment.

Action (a)

Actions are the execution of the agent's decisions. It's the behaviour it performs in interacting with the environment.

Reward (R)

Reward is the core of RL. It's the reinforcer of the desired behaviour we are hoping the agent to learn. It is a scalar that is fed back by the environment based on the action the agent takes. In other words, the reward signal is an immediate measurement of the agent's progress.

Policy ($\mathbf{\pi}$)

Policy is the strategy that the agent employs for deciding its next action.

Value function, V(s)

Value function is a function of states, $\mathbf{s}$, and is a measure of the goodness of any given state. We consider goodness to be equivalent to the expected future rewards. In other words, if being in state A opens the path to states with high rewards that the agent can take next (high expected future rewards), then state A has high value.
Formally, the value function is represented as below:

$\begin{aligned}V^\pi({s_t}) = E_{\pi}[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... | s_t=s] = E_{\pi}[\sum_{k=0}^{\infty}\gamma^kr_{t+k} | s_t=s]\end{aligned}\tag{1}$
This equation is computing the accumulation of all the future rewards that are expected given that the agent is in state $s$ at time $t$ and takes actions under policy $\pi$. The term $\gamma$ that acts as a weight for future rewards is called the *discount factor* and usually has a positive value less than 1. This discount factor reduces the future rewards because they are delayed as opposed to the immediate reward received at time $t$.

Q function, Q(s, a)

Q function is similar to the Value function with the difference being that it measures the goodness of a given state, action pair. That is, V(s) computes the expected future reward on all possible actions, whereas Q(s,a) computes the expected future reward of each individual action. We can formulate this goodness measure as Eq.2.

$\begin{aligned}Q^\pi({s_t, a_t}) = E_{\pi}[\sum_{k=0}^{\infty}\gamma^kr_{t+k} | s_t=s, a_t=a]\end{aligned}\tag{2}$


Model is a description of the environment in terms of a distribution over the rewards and states. In other words, if we know what is the probability of transitioning from state to state $s'$ and receiving reward $r$ using action $a$ from state $s$, we have a model of the environment.

The Reinforcement Learning Objective

Maximising the accumulated reward is the main objective of reinforcement learning problems. This can be done through maximising the Q function for example or it can be done through learning the policy directly. Either approach has various algorithms.

Different Types of RL algorithms

RL algorithms fit into different categories:

Model-based vs. model-free algorithms:

Algorithms that use a function of state transitions in the environment are model-based methods, whereas algorithms that do not consider any information about the environment are model-free.
Model-based approaches have the advantage of capturing all the information there is about the problem, all the dynamics of the environment, etc. You can think of it as simulating a blueprint of the environment where you can easily give it an unseen frame of a video game and ask what the next frame will be. However, this approach has some downsides. One is that it requires high computation and is likely to consider details that are irrelevant to the game itself. To clarify, consider using a model-based RL algorithm for learning how to play an Atari game. This model-based approach will first learn the essence of the game, how certain actions transition the current frame to the next frame. Assuming there is some random video in the background, there will be a lot of action-independent information in the that the algorithm considers relevant and therefore consumes a lot of compute and capacity to learn these dynamics which have no effect on the actual game.
Another disadvantage is that once the environment's dynamics are learnt, going from this model to the actual policy for determining the agent's actions is non-trivial and can also be computationally expensive.

Value-based vs. Policy-based algorithms:

In value-based methods, the model optimises the value function and implies the policy from the optimal value function, i.e. picks the action that gets to the state with the highest value. This approach is closer to the true objective of maximising the cumulative reward, but still has the issue of focusing on less important details. For example, consider a grid where a robot is in and wants to learn the optimal policy. The policy in this setting can happen to be as simple as "always go up" which is fairly easy to learn, however the value optimisation process which this simple policy will be derived from may not be as straightforward. This can be due to various observations that the robot has in the grid that may slow it down temporarily, e.g. some object pops up and gets in the way. These observations change the value function and we will need a rich value function approximator such as a deep neural network to approximate it, whereas the policy itself is completely blind to the observations and is very simple. So in this example, learning the policy directly will be easier and more efficient. This is what policy-based algorithms focus on, learning the policy explicitly through optimising a policy objective function which is the true objective of RL. Using all the data to only update the policy however may not always be beneficial as some samples contain little information about the policy and a lot about the environment, and discarding this valuable knowledge is bad.
There are approaches that combine the benefits of value-based and policy-based learning, such as actor-critic learning.

Off-policy vs. on-policy algorithms:

Some RL algorithms such as Q-learing, update the estimated return at each iteration irrespective of the policy at hand. Whereas on-policy algorithms such as SARSA, estimates the return for state-action pairs assuming the current policy continues to be followed. To clarify, in off-policy learning, the agent interacts with the environment under a policy by taking actions. These interactions are saved into a "buffer" along with all the other interactions made thus far. At each time-step the policy is updated based on samples from all the past interactions in this buffer and not solely from the most recent policy. However, during on-policy, there is no buffer and the policy gets updated based on the agent's most recent interaction with the environment.


In this article, I gave a brief introduction into what RL is and what are the categories that different RL algorithms fit into.
I hope you enjoyed learning about the basics of reinforcement learning. If you like reading about machine learning, natural language processing and their applications, follow me on twitter for more content!