Machine learning can improve macro trading strategies, mainly because it makes them more flexible and adaptable, and generalizes knowledge better than fixed rules or trial-and-error approaches. Within the constraints of pre-set hyperparameters machine learning is continuously and autonomously learning from new data, thereby challenging or refining prevalent beliefs. Machine learning and expert domain knowledge are not rivals but complementary. Domain expertise is critical for the quality of featurization, the choice of hyperparameters, the selection of training and test samples, and the choice of regularization methods. Modern macro strategists may not need to make predictions themselves but could provide great value by helping machine learning algorithms to find the best prediction functions.

The below are notes based on a review of the “Foundations of Machine Learning,” an online training course by David S. Rosenberg, albeit solely from the angle of macro trading strategies.

The post ties in with this site’s summary of quantitative methods for macro efficiency.

### Key benefits of machine learning for macro trading strategies

Most systematic macro trading strategies are based on fixed rules. Fixed trading rules are often maintained until they evidently break. By contrast, __decision making with machine learning is based on variable rules__. Since financial market environments are prone to structural change and instability this is a critical advantage.

Conventional trading rules are based on trial-and-error. Their generation is time and labour intensive. Moreover, fixed rules do not deal well with uncertainty and unanticipated input, such as unprecedented volatility shocks or negative interest rates. By contrast, __machine learning systems generalize knowledge better and are more easily adjustable than conventional rules, as long as they are provided with sufficient data__. New experiences automatically become new training data that condition future actions.

Supervised machine learning algorithms propose actions based on a set of training data and a restricted hypothesis space. Training data are pairs of input and output data. The hypothesis space describes the type of prediction functions that the algorithm may consider. Given these restrictions, the data are allowed to learn on their own, as opposed to just reverse-engineer expert rules or verify prior beliefs. This way __machine learning challenges or at least refines conventional wisdom by design__. Learning becomes smooth and continuous. This lessens obstructions to learning that arise from rigid institutional constraints or personal attachment to specific beliefs.

### The role of expert knowledge

__The rise of machine learning does not devalue expert knowledge in economics and finance.__ Supervised learning methods require qualified prior beliefs. For example, the data scientist must choose plausible data sets and hyperparameters that control model complexity and model type. These choices require ample experience and domain knowledge.

__Inputs into machine learning algorithms can be of a large variety of types, including text and images__. Yet all must be translated into fixed-dimensional vector space to be fed into prediction function. The mapping from raw information (without structure) to a fixed-dimensional vector space is called **featurization** or feature extraction. This is a very important step that requires domain expertise. __The more problems feature extraction solves the fewer difficulties the machine learning algorithm has to deal with__. For financial market practice, this means that the better we are able to structure our input data from the myriad of available information, the easier the application of machine learning becomes. This means that __knowledge of markets and economics remains important for competitive advantage__.

### How machine learning supports decision making

Decision theory is about choosing the best actions, under various definitions of optimality. __Action is also the generic term for the output of a machine learning system, based on a pre-defined action space__. The decision function (which is equivalent to a prediction function) takes an input and prescribes action. __This decision function is the key product of machine learning__. Actions are evaluated with respect to their outcome, typically by use of a loss function.

__For formalization, decision theory refers to three spaces: input space, action pace and output space__. In the case of macro trading the input space could contain relevant market and economic information (typically multiple real number time series). The action space could be a proposed trade and the output space could be the return on this trade. The spaces depend on the type of machine learning algorithm that was chosen.

Many problem domains can be formalized as four steps: [1] observe an input, [2] take an action, [3] observe the outcome, and [4] evaluate the action in relation to the outcome. __The evaluation of actions is the subject of standard learning theory__. This theory is based on the idea that we want to find a decision function that does well on leverage. i.e. producing a loss through action that is small. __Expected loss is called risk of actions__. Typically, this needs to be estimated based on available data and assumptions of their properties.

__Empirical risk is the loss based on available input/output data__.

A **Bayes decision function** is a function that __achieves minimal risk among all possible functions__. Its risk is called Bayes risk. However, the __in-sample optimal decision function, simply based on empirical loss may be indeterminate and not be the best out-of-sample__. This is where machine learning algorithms come in.

The key qualification of machine learning methods is **generalization**. __Generalization ____means spreading information we already have to other training points or other parts of the input space__ that we have not seen. This requires some “smoothness” in the prediction function, i.e. similar inputs should have similar outputs. Machine learning seeks to constrain prediction functions so that such smoothness is achieved. One approach is called **constrained empirical risk minimization**. __Instead of minimizing empirical risk over all possible decision functions it constrains those functions to a particular subset__, called a hypothesis space. The best function within that constrained space is called “risk minimizer”.

### The train-test principle

Machine learning translates training data into **prediction functions** or **decision functions**. These functions deliver predictions or prescribe actions, called **labels**, for a case based on available features, represented by a **feature vector**.

The evaluation of prediction function is typically based on loss, a metric for the gravity of errors. It is calculated based on a specific **loss function **(such as squared errors or absolute errors) and based on a **test set** of data that is independent of the training set based on which the prediction function was chosen. __It is important that the test set does not contaminate the training__. This means that its information must not influence the choices with respect to the machine learning algorithm or the prediction function. Unfortunately, this is a significant risk with financial time series, because researchers typically know features of the history on which prediction functions are tested. If information of labels sneaks into the features in a way that would never happen in deployment this is called **leakage**.

The train-and-test principle of machine learning is a simulation of the traditional train-and-deploy principle prevalent in the investment industry. However, it is much more efficient and cheaper.

__Train-test evaluation is another area where domain knowledge of experts is essential.__ Financial market data sets are prone to non-stationarity, which here refers to change in the data distribution, typically due to **covariate shift** (input distribution changed between training and test) or **concept drift** (correct output for given input changes over time). This can be due to policy changes (e.g. inflation targeting, quantitative easing), market structure changes (e.g. exchange rate regimes shifting from fixed to flexible) or technological changes (e.g. enhanced short-term information efficiency). __It is inappropriate to train and test over influential structural changes__. This would lead to what is called **sample bias**.

Test sets must be large enough to be meaningful. This can be an issue for macro trading strategies as there is only limited history of financial crises or business cycles. K-fold cross-validation is s standard train-test evaluation, particularly for smaller samples. This method selects k prediction functions and performances based on k different (albeit generally overlapping) training sets and k independent (non-overlapping) test sets. __Cross-validation is not concerned with the performance of an individual prediction function, but with the performance of the model building algorithm__. Each algorithm would produce a mean and standard deviation of loss measures. Of course, the actual prediction function used for deployment would be based on all the data.

For time series cross-validation is typically done through **forward chaining** based on expanding training time series. This allows checking if a specific machine learning algorithm consistently produces good prediction functions across time.

__If we want to optimize over different learning hyperparameters, we need to divide the data into training, validation and test set__. Hyperparameters are chosen by the data scientist in supervised learning to control model complexity, the definition of complexity, the optimization algorithm or the model type. The training data fit a prediction function based on a specific set of hyperparameters. The validation data is used for tuning the model’s hyperparameters. And the test data set is used for evaluating the algorithm including the tuning process.

### The overfitting problem

A major practical pitfall in statistical learning is that features (such as predictors of asset returns) are relatively cheap to produce these days. Hence, quantitative researchers have a proclivity for overfitting (view post here). That proclivity increases with the neglect of structural information and expert knowledge. Overfitting translates into__ large gaps between the training and the test performances of models__. Therefore, it is often appropriate to limit model complexity, based on qualified prior judgment, available data, and out-of-sample forecasting results.

Macro and finance is a field with many correlated data series and – when it comes to key macro events – quite limited history. This means __we have many candidate predictors and only a limited number of experiences of specific occurrences, such as financial crises or currency devaluations__. Importantly, complexity requires a sufficiently large number of data. The ratio of parameters to sample size must be reasonable.

The main defense against overfitting is **regularization**. Regularization means __constraining the level of model complexity so that the model performs better at predicting or generalizing__. Regularization produces models that fit data less well in the training sample with the intended benefit of fitting data better out-of-sample. There are two basic types of regularization. The first is to set the maximum complexity as a hyperparameter. That would simply constrain empirical risk minimization. This type is called **Ivanov regularization**. The alternative would be penalized empirical risk minimization. Penalizing the measure of complexity this means a trade-off between loss and complexity. This “soft constraint” is called **Tikhonov regularization**. For many machine learning algorithms, including LASSO and Ridge regression, these two forms of regularization are equivalent.

For each constraint parameter, there will be a different result. This called regularization path. The point is to find a path between underfitting and overfitting.

Lasso and ridge regression are the major workhorses of modern data science. They use two types of regularization with slightly different properties.

**Ridge regression**is a regression that adds “squared magnitude” of coefficient as penalty term to the loss function. This means that it is based on L2 regularization.__Coefficients are generally reduced vis-a-vis unconstrained regression, but regressors are not dropped altogether__.**Lasso**(Least Absolute Shrinkage and Selection Operator) regression adds “absolute value of magnitude” of coefficient as penalty term to the loss function. This means that it is based on L1 regularization.__Lasso often gives__. This helps to make models more interpretable.**sparse solutions**, i.e. a subset of coefficients will have zero values and the dimension of the input vector will be reduced

### Regularization issues

The application of regularization requires some in-depth understanding of the chosen method and knowledge of the data used. __Most problems arise from the use of inputs that have similar or even identical information content__. In Ridge or Lasso regression adding many time series with the same information content biases predictions to using the pre-selected type of information. Using time series with different scale and the same information content makes regularization methods prefer the features with a large scale, as they incur less of a penalty in terms of coefficient size. That is why features should usually be standardized.

**Duplicate features** refer to added features (predictor candidates) that do not give new information. __The regularization type affects how weights are split between duplicates. __For example, L2 will typically split between equal features (adjusted for scale) as it “dislikes” large values for any individual feature. L1 typically yields a range of equivalent solutions and just will make sure that features with equal information have the same coefficient sign.

**Correlated features** with the same scales are quite common in financial market models. The higher the correlation the closer we get to duplicate features. For L1 regularization this means that __minor perturbations (in data) can drastically change ____optimal coefficients. Solutions would be very unstable__. Division of weight among highly correlated features (of the same scale) will look quite arbitrary. Elastic net combines lasso and ridge penalties (L2 and L1 regularization) and mitigates the instability issue.