Through ‘R’ and ‘Python’ one can apply a wide range of methods for predicting financial market variables. Key concepts include penalized regression, such as Ridge and LASSO, support vector regression, neural networks, standard regression trees, bagging, random forest, and gradient boosting. The latter three are ensemble methods, i.e. machine learning techniques that combine several base models in order to produce one optimal prediction. According to a new paper, these ensemble methods scored a decisive win in the nowcasting and out-of-sample prediction of credit spreads. One apparent reason is the importance of non-linear relations in times of high volatility.

The below are excerpts from the paper. Headings and texts in italics and brackets have been added. The explanatory videos on the main statistical methods use come from the youtube channels of Josh Starmer, Simplilearn and Udacity.

The post ties in with this site’s summary on “quantitative methods to macro information efficiency”.

### Brief overview of  competing statistical prediction methods

#### Ridge and LASSO regression

“In variable selection and regularization, Ridge and LASSO regressions are two commonly used methods. They are developed specifically to solve the problem of collinearity in datasets with many variables. They are based on standard linear regression plus a regular term to reduce the model variance. Both Ridge and LASSO regression use all of the variables in the dataset, and adjust the coefficient estimates of non-significant variables to “shrink” towards the zero. The main difference between the two methods is that Ridge keeps all variables, but LASSO allows the penalty to force some parameters to equal zero. Thus, LASSO has variable selection features and produces a reduced model.”

Notes:
 Regularization is a way to avoid overfitting by penalizing high-valued regression coefficients. In simple terms, it reduces parameters and shrinks (simplifies) the model.
 Ridge regression belongs to a class of regression tools that use L2 regularization. L2 regularization adds an L2 penalty, which equals the square of the magnitude of coefficients. All coefficients are shrunk by the same factor. This means that none are eliminated.
 The acronym “LASSO” stands for Least Absolute Shrinkage and Selection Operator. Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regularization can result in sparse models with few coefficients.

#### Support Vector Regression (SVR)

“Support Vector Regression (SVR) is the regression version of Support Vector Machines classifier (SVM)…In SVM, a hyper-plane is used to divide p-dimensional feature space into two halves. A good separation is achieved when the hyper-plane has the largest distance to the nearest training data point of any class. SVR is trying to find a hyperplane that minimizes the distance of all data to this hyperplane. The task of the SVR is to cover as many sample points as possible with a fixed-width stripe so that the total error is as small as possible.”

#### Neural Network

Neural network is an operation model consisting of a large number of nodes (or neurons) connected to each other. A classic neural network model has at least two layers: input layer and output layer, and the intermediate hidden layers capture the complexity of the system. Nodes are located on layers, whereby each node represents a specific output function called an activation function. The connection between each two nodes represents a weighted value for the signal passing through the connection, called the weight, which is equivalent to the memory of the artificial neural network.
A neural network has many hyper-parameters: the number of layers, number of nodes on each layer, drop rate of layers and so on. Just like other methods, we are using cross-validation method to tune and set the hyper-parameters to obtain a good enough predictive accuracy.”

#### Regression Tree

“Regression tree is the regression version of decision tree. The tree method seeks to split the data recursively into subsets so as to find linear solutions within the subset that can improve the overall fit. By dividing data into homogeneous subsets to minimize the overall standard deviation, this method uses a top-down approach to choose the best attribute to divide the space. The basic idea is to construct a tree with the fastest decline in entropy value based on information entropy, whereby the entropy value at the leaf node being zero.”

N.B.: Entropy is a measure of uncertainty of a random system. Uniform distributions have maximum uncertainty. Distributions with small standard deviation and no outliers have low uncertainty.

#### Bagging

“Bagging is an abbreviation of bootstrap aggregating. It is a method of sampling with replacement, possibly with duplicate samples. Bagging starts by extracting the training set from the original sample set. Each round draws n training observations from the original sample set using Bootstrapping. A total of k rounds of extraction were performed, resulting in k independent training sets. Each time a training set is used to obtain a model, a total of k models are obtained for k training sets. For the regression problem, the mean value of the above model is calculated as the final result, with all models having the same importance.”

#### Random Forest

“Random forest…is a classifier/regression model containing multiple decision trees, and is built to deal with the overfitting problem of decision and regression trees. The tree method has good in-sample performance but relatively bad out-of- sample performance. Random forests assist to solve the problem by combining the concept of bagging with random feature selection. Random Forest further conducts random feature selection on the subsamples generated from original dataset, and estimate a regression tree on each subsamples. When forecasting, each tree predicts a result, and all the results are weighted to avoid overfitting.”

“The Boosting algorithm optimizes the regression results through a series of iterations. The idea behind boosting is to combine the outputs of many models to produce a powerful overall voting committee.
AdaBoost is an abbreviation of “Adaptive Boosting”…In the Adaboost process the first model is trained on the data where all observations receive equal weights. Those observations misclassified by the first weak model will receive a higher weight, while correct observations have a lower weight. The newly added second model will thus focus more on the error of the first model. Such iteration keeps adding weak models until the desired low error rate is achieved.
Gradient Boosting is the generalized version of AdaBoost. Gradient Boosting selects the direction of the gradient drop during iteration to ensure that the final result is best. The loss function is used to describe the degree of ‘flight’ of the model. It is assumed that the model is not overfitted. The greater the loss function, the higher the error rate of the model. If our model can make the loss function continue to decline, then our model is constantly improving, and the best way is to let the loss function in the direction of its gradient.”

### CDS basics

“The Credit Default Swap (CDS) market…represents the third largest over-the-counter (OTC) derivatives market, with a gross market value of about \$8 trillion US dollars (BIS, 2019).”

“CDS enables market participants to shift the default risk on the firm from an insurance buyer to an insurance seller. The buyer pays a premium to guarantee future potential protection. Hence the premium and the protection legs both determine CDS spread together. The premium leg represents the expected present value of premium payment from the insurance buyer to the seller, while the protection leg indicates the expected present value of the default loss payment from the seller to the buyer. Fairly priced CDS equals the premium leg and the protection leg.”

“By providing insurance against default, CDS enables loan lenders to hedge the default risk of borrowers, where CDS spread is dependent on the direct information about the creditworthiness of the entity named on the derivative security. After the 2008 financial crisis, CDS spreads have become the most closely monitored early warning signals for credit risk changes. The risk-neutral implied default probability estimated from CDS spreads are used to price credit securities, assess credit quality by rating firms, monitor systemic risk, and stress test financial systems by regulators.”

“Unlike the rare credit events, the CDS market offers timely cross-sectional and time-series credit information, gauged by the market…CDS spreads are less affected by liquidity and tax effects compared to bond spreads.”

### Empirical findings

“We compared the predictive performance of a series of machine learning and traditional methods for monthly CDS spreads, using firms’ accounting-based, market-based and macroeconomics variables for a time period of 2006 to 2016.”

“Our sample is based on the CDS constituents in the CDX North American Investment Grade Index, which includes the most liquid North American entities’ CDSs with investment-grade credit ratings…We collect the 5-year CDS spreads of the constituents at the end of each month over the period 2006 to 2016… We collect firm-level accounting-based and market-based variables, analyst forecasts, financial markets, and macro-economic variables…[After adjustments for missing data] 69 entities remain in our sample with 6811 corresponding monthly CDS spreads.”

We focus on the out-of-sample predictive power of the accounting-based and market-based variables on CDS spreads using linear regression and machine learning methods, motivated by reduced-form forward intensity model. To fairly compare these methods, all of the models are estimated using the same set of input variables within the same dataset. To test the out-of-sample predictive performance, we divide the original dataset into an in-sample training set and out-of-sample test set. The methods are estimated on in-sample set to determine respective parameters and evaluated in the out-of-sample set. We…evaluate the predictive performance using root-mean-square error (RMSE)…Smaller RMSE indicates better predictive performance.”

“Our results indicate that machine learning methods can considerably enhance the prediction accuracy of CDS spreads both cross-sectionally and overtime when compared to traditional econometric models quantifying credit risk relationships. Ensemble methods including Bagging, Random Forest, and Gradient Boosting consistently outperform basic interpretable methods, such as Ridge, LASSO, and linear regression, in prediction accuracy and stability. The precision of linear regression fluctuates widely across randomly chosen estimation and test sets and leads to the weakest average out-of-sample prediction power.”  “Ensemble machine learning models, including Random Forest, Bagging, and Gradient Boosting, have outperformed all other methods, both in cross-sectional and longitudinal samples. “

“We further assess the importance of regressors by using the LIME (Local Interpretable Model- Agnostic Explanations) method, to provide more thorough insights into the underlying reasoning for why ensemble MLs are more accurate in predicting CDS spreads…The results…suggest that ensemble ML methods can identify authentic credit information for predicting CDS spreads.”

In times of higher volatility and potential structural breaks, prediction accuracy seems particularly driven by non-linear firm-specific credit risk and broader economic conditions, which are not properly captured by traditional estimation procedures such as OLS.”

“Inflexible methods including Linear and Ridge regression can forecast comparably well along the lines of ensemble machine learning methods in some cases, but can also perform quite poorly in other cases. [and] have more outlier RMSEs and a wider range of RMSEs…Linear regression is the most unstable method.”