In general, machine learning is a form of artificial intelligence that allows computers to improve the performance of a task through data, without being directly programmed. Reinforcing learning is a specialized application of (deep) machine learning that interacts with the environment and seeks to improve on the way it performs a task so as to maximize its reward. The computer employs trial and error. The model designer defines the reward but gives no clues as to how to solve the problem. Reinforcement learning holds potential for trading systems because markets are highly complex and quickly changing dynamic systems. Conventional forecasting models have been notoriously inadequate. A self-adaptive approach that can learn quickly from the outcome of actions may be more suitable. A recent paper proposes a reinforcement learning algorithm for that purpose.

Lau, Thomas and Haoqian Li (2019), “Reinforcement Learning: Prediction, Control and Value Function Approximation”.

Osiński, Błażej and and Konrad Budek (2018), “What is reinforcement learning? The complete guide”.

Simonini, Thomas, “An introduction to Reinforcement Learning”.

The below are quotes from the paper. Headings and text in brackets have been added.

The post ties in with the SRSV summary on quantitative methods to increase macro information efficiency.

### What is reinforcement learning?

“Reinforcement learning is an important type of Machine Learning…The idea behind Reinforcement Learning is that __an agent will learn from the environment by interacting with it and receiving rewards for performing actions__…Reinforcement Learning is just a computational approach of learning from action. Reinforcement Learning is based on the idea of the reward hypothesis. All goals can be described by the maximization of the expected cumulative reward. That’s why in Reinforcement Learning, to have the best behavior, we need to maximize the expected cumulative reward.” [Simonini]

“[In general] __machine learning is a form of artificial intelligence in which computers are given the ability to progressively improve the performance of a specific task with data, without being directly programmed__. **Supervised machine learning** happens when a programmer can provide a label for every training input into the machine learning system. **Unsupervised learning** takes place when the model is provided only with the input data, but no explicit labels. **Deep learning** consists of several layers of neural networks, designed to perform more sophisticated tasks. The construction of deep learning models was inspired by the design of the human brain, but simplified. Deep learning models consist of a few neural network layers which are in principle responsible for gradually learning more abstract features about particular data.

The __key distinguishing factor of reinforcement learning is how the agent is trained__. Instead of inspecting the data provided, the model interacts with the environment, seeking ways to maximize the reward. In the case of deep reinforcement learning,

__a neural network is in charge of storing the experiences and thus improves the way the task is performed__.” [Błażej and Budek]

“There should be no clear divide between machine learning, deep learning and reinforcement learning. It is like a parallelogram – rectangle – square relation, where machine learning is the broadest category and the deep reinforcement learning the most narrow one. In the same way, __reinforcement learning is a specialized application of machine and deep learning techniques, designed to solve problems in a particular way__.” [Błażej and Budek]

“Reinforcement learning is the __training of machine learning models to make a sequence of decisions__. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artificial intelligence faces a game-like situation. __The computer employs trial and error to come up with a solution to the problem.__ To get the machine to do what the programmer wants, the artificial intelligence gets either rewards or penalties for the actions it performs. Its goal is to maximize the total reward.

Although __the designer sets the reward policy–that is, the rules of the game–he gives the model no hints or suggestions for how to solve the game__. It’s up to the model to figure out how to perform the task to maximize the reward, starting from totally random trials and finishing with sophisticated tactics and superhuman skills. By leveraging the power of search and many trials, reinforcement learning is currently the most effective way to hint machine’s creativity. __In contrast to human beings, artificial intelligence can gather experience from thousands of parallel gameplays __if a reinforcement learning algorithm is run on a sufficiently powerful computer infrastructure.”[ Błażej and Budek]

“Reinforcement Learning…is a branch of machine learning explicitly designed for taking suitable action to maximize the cumulative reward. It is employed by an agent…to find the best possible behavior or path it should take in a specific situation. __Reinforcement learning differs from the traditional supervised learning [where] the training data has the answer key…provided by an external supervisor__, and the model is trained with the correct answer…In reinforcement learning, the reinforcement agent decides what to do to perform well (quantified by a defined reward function) in the given task.” [Lau and Li]

### How could Reinforcement learning be used for trading systems?

“The financial market is one of the most dynamic and fluctuating entities that exist, which makes it difficult to model its behavior accurately. However, the __Reinforcement Learning algorithm, as [a] self-adaptive approach, can conquer these type of difficulties by directly learning from the outcomes of its actions__…Specifically, the investment decision in Reinforcement Learning is a stochastic control problem, or a Markov Decision Process (MDP), where the trading strategies are learned from direct interactions with the market. Thus the need for building forecasting models for future prices or returns is eliminated.” [Lau and Li]

“In quantitative trading, a trader’s objective is to optimize some measure of the performance of the executions, e.g., profit or risk-adjusted return, subjected to some certain constraints…Predictive models of price changes for quantitative trading…are trained based on specific machine learning objective functions (e.g. regression and classification loss functions), and thus there is no guarantee for the models to globally optimize their performances under the measure of the trader’s objective.” [Lau and Li]

“With the concrete understanding of the framework of Reinforcement Learning, we will be able to apply Reinforcement Learning to the field of Quantitative Trading…In our paper, __we introduced [a] class of reinforcement learning algorithm that combines both value function approximation and the policy gradient method, namely, the Actor-Critic algorithm__.” [Lau and Li]

“Reinforcement Learning lies in the interactions between the agent and the environment. The history is a sequence of observations, actions, and rewards…

- The
**agent**selects actions and the environment selects observations and rewards. - The
**state**is the information determining what happens next. Formally, an information state is a function of the history containing all the useful information from history, and the process is assumed to possess Markov property. - A
**reward**is a scalar feedback signal indicating how well an agent is doing…__In the trading scenario, we can apply the Sharpe Ratio__. .. - The
**return**is the total discounted reward… - It is usually assumed that the system satisfies the
**Markov property**, which states,,,that the probability of transition from the current state to the next depends only on the current state, instead of the whole history…The future is independent of the past given the present… - There are… two types of
**value functions**, namely, the expected return of a state, and the expected return of an action…Model-free prediction learns an unknown Markov Decision Process by estimating its value function… **Model-free control**stands for optimizing the value function of an unknown Markov Decision Process.” [Lau and Li]”