Supervised machine learning enhances the econometric toolbox by methods that find functional forms of prediction models in a manner that optimizes out-of-sample forecasting. It mainly serves prediction, whereas classical econometrics mainly estimates specific structural parameters of the economy. Machine learning emphasizes past patterns in data rather than top-down theoretical priors. The prediction function is typically found in two stages: [1] picking the “best” form conditional on a given level of complexity and [2] picking the “best” complexity based on past out-of-sample forecast performance. This method is attractive for financial forecasting, where returns depend on many complex relations most of which are not well understood even by professionals, and where backtesting of strategies should be free of theoretical bias that arises from historical experience.

Mullainathan, Sendhil and Jann Spiess, “Machine Learning: An Applied Econometric Approach” , Journal of Economic Perspectives—Volume 31, Number 2—Spring 2017—Pages 87–106.

The post ties in with SRSV’s lecture on information efficiency, particularly the section on “support from data science”.

The below are excerpts from the paper. Emphasis and cursive text have been added. Formulas and symbols have been paraphrased for easier reading.

The role of machine learning in econometrics

“We present a way of thinking about machine learning that gives it its own place in the econometric toolbox…Machine learning not only provides new tools, it solves a different problem. Machine learning (or rather “supervised” machine learning, the focus of this article) revolves around the problem of prediction: produce predictions of y from x. The appeal of machine learning is that it manages to uncover generalizable patterns. In fact, the success of machine learning at intelligence tasks is largely due to its ability to discover complex structure that was not specified in advance. It manages to fit complex and very flexible functional forms to the data without simply overfitting; it finds functions that work well out-of-sample.”

“Theory- and data-driven modes of analysis have always coexisted. Many estimation approaches have been (often by necessity) based on top-down, theory-driven, deductive reasoning. At the same time, other approaches have aimed to simply let the data speak. Machine learning provides a powerful tool to hear, more clearly than ever, what the data have to say.”

“Many economic applications, instead, revolve around parameter estimation: produce good estimates of parameters that underlie the relationship between dependent and explanatory variables…Put succinctly, machine learning belongs in the part of the toolbox marked prediction rather than in the more familiar parameter estimation compartment… Applying machine learning to economics requires finding relevant prediction tasks. One category of such applications appears when using new kinds of data for traditional questions; for example, in measuring economic activity using satellite images.”

“Machine learning algorithms are now technically easy to use: you can download convenient packages in R or Python that can fit decision trees, random forests, or LASSO (Least Absolute Shrinkage and Selection Operator) regression coefficients. This also raises the risk that they are applied naively or their output is misinterpreted.”

On statistical methods for selecting macro trading factors view post here.

How machine learning works

Supervised machine learning algorithms seek functions that predict well out of sample. For example, we might look to predict the value of a house from its observed characteristics based on a sample of houses. The algorithm would take a loss function [where loss quantifies the prediction error for the price based on the characteristics] as an input and search for a function that has low expected prediction loss on a new data point from the same distribution.”

“Familiar estimation procedures, such as ordinary least squares, already provide convenient ways to form predictions, so why look to machine learning to solve this problem?…Applying ordinary least squares to this problem requires making some choices [for example of what explanatory variables to include and whether to consider interactions between there variables]…Machine learning searches for…interactions automatically. Consider, for example, a typical machine learning function class: regression trees. Like a linear function, a regression tree maps each vector of…characteristics to a predicted value. The prediction function takes the form of a tree that splits in two at every node. At each node of the tree, the value of a single variable determines whether the left or the right child node is considered next. When a terminal node—a leaf—is reached, a prediction is returned.”

“The very appeal of machine learning is high dimensionality: flexible functional forms allow us to fit varied structures of the data. But this flexibility also gives so many possibilities that simply picking the function that fits best in-sample will be a terrible choice. So how does machine learning manage to do out-of-sample prediction?

  • The first part of the solution is regularization. In the tree case, instead of choosing the “best” overall tree, we could choose the best tree among those of a certain depth… The shallower the tree, the worse the in-sample fit… But this also means there will be less overfit: the idiosyncratic noise of each observation is averaged out…By choosing the level of regularization appropriately, we can have some benefits of flexible functional forms without having those benefits be overwhelmed by overfit.
  • The second key insight: empirical tuning…we create an out-of-sample experiment inside the original sample. We fit on one part of the data and ask which level of regularization leads to the best performance on the other part of the data.”

“This structure—regularization and empirical choice of tuning parameters—helps organize the…variety of prediction algorithms that one encounters. There is a function class [such as a regression tree] and a regularizer [such as the depth of the tree] that expresses the complexity of a function…Picking the prediction function then involves two steps: The first step is, conditional on a level of complexity, to pick the best in-sample loss-minimizing function. The second step is to estimate the optimal level of complexity using empirical tuning. In [the table below]…we give an incomplete overview of methods that follow this pattern.”

“For example, in our framework, the LASSO (probably the machine learning tool most familiar to economists) corresponds to [1] a quadratic loss function, [2] a class of linear functions (over some fixed set of possible variables), and [3] a regularizer which is the sum of absolute values of coefficients. This effectively results in a linear regression in which only a small number of predictors from all possible variables are chosen to have nonzero values: the absolute-value regularizer encourages a coefficient vector where many are exactly zero.”

“On out-of-sample performance, machine learning algorithms such as random forests can do significantly better than ordinary least squares, even at moderate sample sizes and with a limited number of covariates.”

Drawbacks of machine learning

“The very appeal of [machine learning] algorithms is that they can fit many different functions. But this creates an Achilles’ heel: more functions mean a greater chance that two functions with very different coefficients can produce similar prediction quality. As a result, how an algorithm chooses between two very different functions can effectively come down to the flip of a coin.

“Regularization also contributes to the problem. First, it encourages the choice of less complex, but wrong models…Second, it can bring with it a cousin of omitted variable bias, where we are typically concerned with correlations between observed variables and unobserved ones. Here, when regularization excludes some variables, even a correlation between observed variables and other observed (but excluded) ones can create bias in the estimated coefficients.”