Statistical learning for macro trading involves model training, model validation and learning method testing. A simple workflow [1] determines form and parameters of trading models, [2] chooses the best of these models based on past out-of-sample performance, and [3] assesses the value of the deployed learning method based on further out-of-sample results. A convenient technology is the ‘list-column workflow’ based on the tidyverse packages in R. It stores all related objects in a single data table, including models and nested data sets, and implements statistical learning through functional programming on that table. Key steps are [1] the creation of point-in-time data sets that represent information available at a particular date in the past, [2] the estimation of different model types based on initial training sets prior to each point in time, [3] the evaluation of these different model types based on subsequent validation data just before each point in time, and [4] the testing of the overall learning method based on testing data at each point in time.

The below is based on personal experience of the author and the following references:
“R for Data Science” by Garrett Grolemund and Hadley Wickham,
“Machine Learning in the Tidyverse”, DataCamp interactive course by Dmitriy Gorenshteyn,
“Apply functions with purrr: cheet sheet”,
“Keep It Together – Using the Tidyverse for Machine Learning” by Jared Wilber.

The post ties in with this sites’ summary on “quantitative methods for macro information efficiency”, particularly the section on statistical learning.

Why macro trading needs statistical learning

Statistical learning refers to a set of tools for modelling and understanding complex datasets. A statistical learning workflow for macro trading strategies is a sequence of calculations based on past data that produces meaningful information for discretionary and systematic trading decisions. The term ‘sequence’ indicates that the chronological order of calculation is critical. If the workflow is highly automated it becomes a machine learning workflow.

Unlike other fields of statistical learning, most macro trading strategies are based on limited data sets, particularly if they require actual macroeconomic data on business cycles, inflation trends, economic balance sheets and so forth. Therefore, knowledge of markets, understanding of macroeconomics, and a good grasp of mathematical theory are often more important than data science. However, statistical learning still has important functions, particularly for [1] specifying form and parameters of a trading model, [2] choosing the broad model class for the trading strategy, and [3] checking how well the learning methods deployed in specifying and choosing models would have performed in producing trading value.

The three functions of statistical learning for trading strategies correspond to the classical steps in a statistical learning workflow: model training, model validation and method testing. These three steps must be based on independent data sets. Importantly, a statistical learning workflow helps to adapt the investment process continuously to new information, often more rapidly and smoothly than human intervention.

Basics of the tidyverse list-column workflow for R

The tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. Key underlying methods of the tidyverse for visualization, modelling, transformation, tidying and importing data are explained in the on-line book “R for Data Science” by Garrett Grolemund and Hadley Wickham (view here).

The list-column workflow is a method for applying machine learning in the tidyverse. It makes it easy to create nested workflows and allows keeping all related objects in a single data frame. The basis for this workflow is the tibble, a type of data frame that is optimized for data science applications. One of the features of a tibble is that its cells not only contain simple objects, such as numbers, logicals or strings, but also lists and complex objects that are stored in list form. These include other data frames and models. Columns that store such list objects are called list columns. If a data frame stores another data frame in its cells it is called a nested data frame.

List columns allow working with a multitude of models and datasets inside a single tibble through functional programming. A functional is a function that takes other functions as arguments. The tidyverse prefers functionals of the `purr` package to apply functions to lists of data sets or lists of models. In a nutshell, the list-colum workflow of the tidyverse operates in three steps: [1] structure a tibble with the appropriate list columns, [2] operate on and with the list columns by applying `purrr::map()`, `purrr::map2()` or similar functionals, and [3] simplify some resulting list columns either by using `tidyr::unnest()` or  directly using output-specific map functions, such as `purrr::map_dbl()`

The information stored in model objects in R can be used to perform analysis or subsequent calculations, such as predictions. The `broom` package offers functions that extract key information about statistical objects in tidy tibbles.  Three functions are particularly useful: [1] the `broom::tidy()` function extracts information about the components of a model, such as coefficient estimates and t-statistics, [2] the `broom::glance()` function extracts a tibble with exactly one row of model summaries, typically goodness of fit measures, p-values for hypothesis tests on residuals, or model convergence information, and [3] the `broom::augment()` function accepts a model object and a dataset and returns a data frame with response, explanatory, fitted and residual series and can take new data for predictions.

The list-column workflow is particularly useful for cross validation, a statistical method that is widely used to assess the skill of machine learning models. Cross validation requires the separation of a data sample into three parts, one for the purpose of model fitting (core training sample), one for model evaluation and selection (validation sample) and one for broad method evaluation (testing sample). The composite of the core training data and validation data will be called broad training sample in this post.

The broad training data are typically used for selecting the most suitable model. For this purpose the broad training data set must be divided into a core training set and a validation set. The core training set is used to fit various models. The validation set is used to evaluate the fitted models ‘out of sample’. Validation with a distinct set allows an assessment of how models performed from the perspective of various statistics, without being contaminated by the core training sample or contaminating the final testing sample.

In a final step, the chosen ‘best’ model can then be evaluated based on the hitherto unattended test data. Here testing is an evaluation of methodology (of specifying and selecting models), rather than of a specific model. This final step has two stages. First, the best model is estimated based on the full broad training data. Then its performance statistics are derived by applying the estimated model to the test data.
A poor result in final test evaluation does not necessarily mean that a trading strategy idea has no value. Often it just indicates that more work is to be done formulating the underlying theory. Also, poor test numbers may be due to invalid cross-validation, for example ignoring time series properties such as overlapping lookback horizons, or inappropriate performance statistics, such as correlation rather than accuracy for strategies whose signals have no proportionate implications.

Point-in-time data structure

Point-in-time data are associated with specific dates and carry exactly the values that were available at the respective date. They are like snapshots of information sets that were available in the past, prior to subsequent revisions or changes in conventions. Point-in-time data are opposed to live data, as provided by most data vendors and which may contain revisions, extensions and recalculations. For the specific purpose of realistically simulating past forecasts point-in-time data are more informative. This is true particularly if actual economic data are used, which are often subject to large and repeated revisions.

Point-in-time data for macro trading strategies require a specific data structure where the full past available history (i.e. a data frame) is stored at each relevant date. Conveniently, the historic point-in-time data situation can be summarized by a single points-in-time data tibble. Conventional time series data frames record one number at one point in time. The points-in-time tibble records the complete available data history at a specific point in time. The below simple example shows a points-in-time tibble with an extended time series (xts) data frame stored at each month-end.

dv.all <- index(xdm.irs["2017-01/", ]) # points in time
tbl.all <- tibble(dates = dv.all, xdm.irs = l.xdfs) # points-in-time tibble creation 

Training and validation

Validation means that we use out-of-sample forecasting performance (rather than in-sample fit) as basis for choosing a prediction model. This process can be applied similarly to a broader in-sample learning method, where the model form or type are determined endogenously. Hence validation is a basis for point-in-time model choice or broader method choice. In a time series context, cross validation is performed for each point-in-time data set with respect to all models/methods under consideration.

For cross-validation one first creates rolling sequences of training-validation splits of the broad training set. There will be one sequence of splits for each dated data set. The function rsample::rolling_origin() creates such splits for data points that are consecutive values. Under the principle of information efficiency, the core training samples should be all historic observations up to the split date. The validation sample should be only a single post-split date. The pre-split date would be rolled from a minimum initial window up to the penultimate available date.

In the below simple example, the point-in-time tibble is extended by a list column of split tibbles (tbl.sps). Each split tibble contains a sequence of splits, where the core training part of the split ranges from a minimum up to the largest possible. The later the date, the longer the sequence of splits associated with the date, unless a maximum history has been set.

tbl.als <- tbl.all %>% 
  mutate(
    # create list of tibbles of sequential train-validation splits:
    tbl.sps = map(.x = xdm.irs, ~rolling_origin(.x, initial = 72, assess = 1))
  )
tbl.als
tbl.als$tbl.sps[[1]]  # split tibble

Based on these splits, training and validation samples can be extracted with the rsample::training() and rsample::testing() functions. For each point-in-time data set we deploy a ‘rolling sequence’ of training sets (tbn.trn) and validation sets (tbn.val). The average performance of the models on the validation sets will be the basis for choosing the preferred model for the respective point in time.

Below we show the validation process for a single point-in-time dataset. First, sequences of training and validation sets are extracted as panel tibbles. Then the training of two different prediction models is mapped onto the sequence of training sets. This creates a trained or fitted model for each split based on the respective training data. The resulting extended split tibble, which is just one cell in the main tibble, is now extended to include data sets and models.

tbl.sps <- tbl.als$tbl.sps[[1]] %>%
  mutate(
    # create list column with core training df for each split:
    tbn.trn = map(splits, ~as.tbl(stackPanel(training(.x), cids = id.irs, cats = cat.irs))),
    # create list column with validation df for each split
    tbn.val = map(splits, ~as.tbl(stackPanel(testing(.x), cids = id.irs, cats = cat.irs))),
    mod1 = map(.x = tbn.trn,  ~lm(formula = as.formula("IRS5_XR ~  BBR + RES"), data = na.omit(.x))),
    mod2 = map(.x = tbn.trn,  ~lm(formula = as.formula("IRS5_XR ~  BBR + RES + EMG3MA"), data = na.omit(.x)))
  )
tbl.sps 

The fitted sequences of the candidate models are then used for prediction based on the one-period validation samples. Then for each validation set the performance of the predictions (here one prediction per cross-section of a panel) is measured against the actual values (here returns). In this case this simply measured based on the cross-sectional hit ratio.

tbl.spp <- tbl.sps %>%
  mutate(
    # calculate all cross-sectional predictions of model 1 for each validation slice:
    v.pred1 = map2(.x = mod1, .y = tbn.val, ~predict(.x, na.omit(.y))),
    # calculate all cross-sectional predictions of model 2 for each validation slice:
    v.pred2 = map2(.x = mod2, .y = tbn.val, ~predict(.x, na.omit(.y))),
    # extract actual cross-sectional values (returns) for each validation slice:
    v.actual = map(tbn.val, ~na.omit(.x)$IRS5_XR),
    # cross-sectional hit ratio of model 1 for each validiation slice:
    hr1 = map2_dbl(.x = v.pred1, .y = v.actual, ~mean(sign(.x) == sign(.y))),
    # cross-sectional hit ratio of model 1 for each validiation slice:
    hr2 = map2_dbl(.x = v.pred2, .y = v.actual, ~mean(sign(.x) == sign(.y)))
  )
tbl.spp

All operations on the point-in-time data sets can be performed on the points-in-time tibble by use of purrr::map functionals, as shown below for a very simple stylized example. Such code is concise but implies nested loops and can take time to run. For updating data sets past results should be stored and loaded.

map_all <- function(tbl) {  # function that works any split tibble
  tbl %>%
    mutate(
      tbn.trn = map(splits, ~as.tbl(stackPanel(training(.x), cids = id.irs, cats = cat.irs))),
      tbn.val = map(splits, ~as.tbl(stackPanel(testing(.x), cids = id.irs, cats = cat.irs))),
      mod1 = map(.x = tbn.trn,  ~lm(formula = as.formula("IRS5_XR ~  BBR + RES"), data = na.omit(.x))),
      mod2 = map(.x = tbn.trn,  ~lm(formula = as.formula("IRS5_XR ~  BBR + RES + EMG3MA"), data = na.omit(.x))),
      v.pred1 = map2(.x = mod1, .y = tbn.val, ~predict(.x, na.omit(.y))),
      v.pred2 = map2(.x = mod2, .y = tbn.val, ~predict(.x, na.omit(.y))),
      v.actual = map(tbn.val, ~na.omit(.x)$IRS5_XR),
      hr1 = map2_dbl(.x = v.pred1, .y = v.actual, ~mean(sign(.x) == sign(.y))),
      hr2 = map2_dbl(.x = v.pred2, .y = v.actual, ~mean(sign(.x) == sign(.y)))
    )
}

tbl.alx <- tbl.als %>% # shorten df to save run time
  mutate(
    tbl.sps = map(.x = tbl.sps, map_all)  # extends each element of list column (time intensive)
  )
tbl.alx
tbl.alx$tbl.sps[[3]]

Testing

Testing here means assessing the performance of the broad method and hyperparameters for sequentially building and choosing models. Just as for model training and validation, the splits into method choice/training and testing for each point in time use only the latest period for testing and all the previous periods for training. The differences are that [1] we now have one more time period that had been reserved for testing and [2] only the chosen model is trained based on the full broad training data set.

The model choice is usually made according to the average prediction performance of all validation sets prior to that point in time. In the present example, this is simply an average cross-sectional hit ratio. The average can be taken either [1] over all validation sets of the point in time prior to the latest or [2] the final validation sets of all points in time before the latest. The former is preferable if revised data are viewed as better for model choice and the latter if the unrevised data are seen as more suitable.

tbl.alt <- tbl.alx %>% 
  mutate(
    # create splits for final training/test sets
    split = map(.x = xdm.irs, ~initial_time_split(.x, prop = (nrow(.x) - 1) / nrow(.x))),
     # panel training tibbles:
    tbn.trn = map(.x = split, ~as.tbl(stackPanel(training(.x), cids = id.irs, cats = cat.irs))),
    # panel testing tibbles:
    tbn.tst = map(.x = split, ~as.tbl(stackPanel(testing(.x), cids = id.irs, cats = cat.irs))),
    # average performance of model 1:
    hr1 = map_dbl(.x = tbl.sps, ~mean(.x$hr1, na.rm = T)),
    # average performance of model 2:
    hr2 = map_dbl(.x = tbl.sps, ~mean(.x$hr2, na.rm = T)),
    # hit ratio difference as basis for model choice:
    hrd12 = hr1 - hr2,
    # point-in-time model choice:
    mod = map2(.x = tbn.trn, .y = hrd12,  ~lm(formula = as.formula(ifelse(.y, "IRS5_XR ~  BBR + RES", "IRS5_XR ~  BBR + RES + EMG3MA")),  
                                              data = na.omit(.x))), 
    # actual returns of test sample:
    v.acts = map(.x = tbn.tst, ~.x$IRS5_XR),
     # predicted returns of test sample:
    v.pred = map2(.x = mod, .y = tbn.tst, ~predict(.x, .y)),
    # hit ratio of test:
    hr = map2_dbl(.x = v.pred, .y = v.acts, ~mean(sign(.x) == sign(.y), na.rm = T))
    )

The point-in-time tibble now includes for each date the broad training set (tbn.trn), the test set (tbn.tst), the average hit ratios of the validation sets for models 1 and 2 (hr1, hr2), the difference between these as basis for model choice (hrd12), the chosen model (mod), the vector of actual returns of the test set (v.acts), the vector of predicted returns of the test set (v.pred) and the cross-sectional hit ratio of the test set (hr). The average of the test set hit ratios across time is a fair basis of method evaluation.

tbl.alt
mean(tbl.alt$hr)