The R environment makes statistical estimation and learning accessible to portfolio management beyond the traditional quant space. Overcoming technicalities and jargon, managers can operate powerful statistical tools by learning a few lines of code and gaining some basic intuition of statistical models. Thus, for example, R offers convenient functions for time series analysis (characterizing trading signals and returns), seasonal adjustment (detecting inefficiencies and cleaning calendar-dependent data), principal component analysis (condensing the information content of large data sets), standard OLS regression (simplest method to check quantitative relations), panel regression (estimating one type of relation across many countries or companies), logistic regression (estimating the probability of categorical events), Bayesian estimation (characterizing uncertainty of trading strategies comprehensively), and supervised machine learning (delegating the exact form of forecasts and signals to a statistical method).

The post ties in with the SRSV summary on information efficiency.

The examples below have been chosen based on simplicity and personal experience in the development of systematic trading strategies. They imply no ranking over other packages and methods.

Time series analysis

What is it?

Time series data are measured at consistent time intervals. Time series analysis uses statistical methods to characterize their evolution. The most common model for this purpose is ARIMA (autoregressive integrated moving average) that characterizes the evolution of a variable in terms of its dependence on its own past values, its stationarity, and the persistence of disturbances to its path.

How does it support trading?

Most market and economic data are time series. Time series analysis of returns can reveal risk profiles and – on rare occasions – market inefficiencies. Moreover, ARIMA time series analysis of predictor variables provides guidance for adjustments and transformations. For example, non-stationary explanatory variables typically must be transformed into stationary ones. And times series with calendar patterns typically must be seasonally adjusted (see below).

How can it be done in R?

Useful functions for time series analysis are available in the astsa package. For example, the astsa::acf2() function estimates and visualizes the autocorrelation and partial autocorrelation coefficients of a time series. This gives hints on whether a series is autocorrelated (i.e. whether today’s value has an impact on tomorrow’s value) or whether disturbances to the series are persistent and extend over several periods (moving averages).

The astsa::sarima() function estimates an ARIMA model for specific parameters of variable autoregression, order of differencing required to achieve stationarity, and moving average of disturbance impact. The function estimates the parameters of the hypothesized process and gives analytical graphs to double-check residuals (i.e. what is not explained by the model).

Seasonal adjustment

What is it?

Seasonal adjustment is a specific time series technique for estimating and removing seasonal and calendar-related movements from the time series. For this purpose, time series are decomposed into four components: the seasonal component, the trend component, the cyclical component, and the error component.

How does it support trading?

Seasonality is defined as movements observed in time series that repeat throughout the year, at given time points each year, with similar intensity in the same season. Seasonal fluctuations take the form of regularly spaced peaks and troughs which have a consistent direction and have similar magnitude every year. These patterns can be induced by two main groups of factors: [1] atmospheric conditions between seasons and [2] social habits and practices originated from holidays. Seasonal movements are expected to be predictable.

Seasonal adjustment has a direct and indirect application to macro trading. The direct application is the detection of seasonal patterns in contract prices or returns. The – more common – indirect application is the detection and removal of seasonal patterns in trading factors. Whenever a trading factor displays seasonal influences but the returns do not, the former needs to be seasonally adjusted.

How can it be done in R?

The seasonal package provides an easy-to-use and full-featured R-interface to X-13ARIMA-SEATS,  which is the seasonal adjustment software developed by the United States Census Bureau.  Its automated procedures allow to quickly produce good seasonal adjustments of time series

The core function of the package is seas(). By default, seas() calls the automatic procedures of X-13ARIMA-SEATS to perform a seasonal adjustment that works well in most circumstances. The first argument of seas() has to be a time series of class “ts”. The function returns an object of class “seas” that contains all necessary information on the adjustment.

Principal component analysis

What is it?

Principal component analysis (PCA) is a popular dimension reduction technique that does not use or require any theoretical structure. A dimension is a type of value associated with an observation. For example, the various real economic data series available for a country can be regarded as the dimensions of the real economic data set. PCA simply condenses high-dimensional data into a low-dimensional sub-space in a way that minimizes the residual sum of squares of the projection. That means that PCA minimizes the sum of squared distances between projects and original points.

How does it support trading?

Even the most attentive and analytical portfolio manager cannot possibly follow all relevant statistical information. Dimension reduction condenses the information content of a multitude of data series into a small manageable set of factors. This reduction is important for forecasting because many data series have only limited and highly correlated information content. Dimension reduction techniques that use theoretical structure and logic are preferable/ But where this is not possible, PCA provides a quick and easy way to cut down large data sets.

How can it be done in R?

A convenient way to perform principal components analysis in R is the function prcomp() function of the standard stats package. It returns results as a list object of class prcomp. If one has a data frame of the set of variables whose principal components are to be estimated the generic application of the function is prcomp(dataframe, center = TRUE, scale. = FALSE,…), where center is a logical value with True transforming the original series to zero mean and scale is a logical value with True, scaling the original series to unit variance. Since the relative magnitude of the variables influences the resulting principal components, one should typically normalize variables that do not naturally have the same or comparable scales.

The plot method returns a plot of the variances associated with the principal components. The summary method describes the importance of the individual principal components.

The generic function predict(object, newdata,…) can be applied to an object of class prcomp and “new observations” specified by argument newdata, such that the principal components for the new data are fitted based on the estimation results in prcomp. Data visualization is one of the key motivations of PCA, since an eye sense consciousness can only apprehend up to three dimensions. For example, the factoextra package makes it easy to extract and visualize the output of exploratory multivariate data analyses, including PCA. For example, the fviz_pca_var() function displays the original time series in terms of the top 2 dimensions of the PCA.

R offers packages for more advanced dimension reduction techniques, such as the selection of a set of background factors through dynamic factor analysis (MARSS package) or the selection of a subset of best explanatory variables with the Least Absolute Shrinkage and Selection Operator (LASSO) or “elastic net” (glmnet package or elasticnet package).

Basic OLS regression

What is it?

Linear regression is a mathematical operation on sample data that assumes that the underlying population is characterized by a dependent variable that is linearly related to a set of explanatory variables and a stochastic disturbance. Linear regression further assumes a set of other regularity conditions that depend on the calculation method used. If the assumptions are correct, regression allows clearly specified inference on the coefficients and residuals of the underlying linear population model.

Ordinary least squares (OLS) is the most popular version of linear regression. It computes the values of linear coefficients that minimize the squared distance between predicted and actual dependent variables in a sample. An OLS regression coefficient is really just an enhanced version of a covariance-variance ratio. Its numerator is a “net” covariance between the dependent and one explanatory variable accounting for the influence of the other explanatory variables. The denominator is the variance of the explanatory variable.

How does it support trading?

Most trading strategies, whether quantitative or not, rely on the relation between a predictor variable and a predicted variable. Trades are often based on the belief that if x happens then y will follow. Thus a trader may believe that if a currency depreciates, the local inflation rate may increase with consequences for local asset prices. Under many circumstances linear regression is simply the easiest way [1] to verify if this belief is significantly supported by the data and [2] to estimate what the magnitude of the response will be.

How can it be done in R?

The standard function to estimate linear models in R is the lm() function of the stats package. Its typical usage is lm(form, df,…), where form is an object of class formula that describes the model to be fitted and df denotes the dataframe from which the variables are taken. The function returns an object of class lm, which is a comprehensive set of data and information.

The plot method can be applied to an lm object in order to obtain visual checks for heteroskedasticity, non-linear relations, non-normality and undue influence of outliers. Put simply this output informs on whether the assumptions on which the regression results are based are appropriate and whether a few extreme observation may have biased the findings.

Panel regression

What is it?

Panel regression is simply regression applied to a data panel. A panel contains multiple cross sections of one type of time series data, such as multiple countries’ equity price data. A panel model typically describes the evolution of panels in the form of a linear equation system. Hence rather than specifying a single relation between a dependent variable and its predictor variables, the panel model looks at multiple instances of the one type of linear relation. This could be, for example, the evolution of equity prices in dependence upon changes in interest rates for a range of countries.

How does it support trading?

Panel regression is particularly useful for assessing the impact of macroeconomic and company-specific trend factors, which typically have low (monthly or quarterly) frequency but are available for many countries or companies. Thus, while individual time series may be too short to validate the impact of trend factors, panels of relative or idiosyncratic trends often produce meaningful statistical results.

How can it be done in R?

Estimation and testing of linear panel models are made really easy by using the plm() and pvcm() functions of the plm package. The real trick is to know which type of panel model to estimate, which requires just a little bit of intuition.

  • The standard linear pooling model posits parameter homogeneity, i.e. it assumes that intercepts (constants) and slope coefficients (sensitivities of predicted variable to predictor) of the linear system are equal for all cross sections.
  • The fixed effects model (or “within” model) assumes that an individual cross-section’s disturbance is correlated with its predictors. In this case, OLS estimation of the slope coefficients would be inconsistent, i.e. the estimator would not converge as the sample increases. As a result, it is customary to treat individual cross section disturbances as a further set parameters to be estimated, just as if they were cross section-specific intercepts. The model is also called “within” model because it implies that the estimated relation between predicted and predictor variables only consider variations within a cross-section and not the variation across sections.
  • The random effects model assumes that the individual disturbance component of a cross-section is uncorrelated with the predictors. In this case, random effects estimation produces lower standard errors around the coefficient estimates than fixed effects estimation. This means it is more efficient. Random effects are estimated with partial pooling, while fixed effects are not. Partial pooling means that, if you have few data points in a group, the group’s effect estimate will be based partially on the more abundant data from other groups.
  • The “between” model is computed based on time averages of the data. It discards all the information due to intertemporal variability. The “between” model is consistent in some settings where there is non-stationarity and is often preferred to estimate long-run relationships. It is the opposite of the “within” model.
  • The random coefficients model relaxes the assumption that slope coefficients are the same across sections, i.e. that the relation between the predictor and predicted variables is equal across sections. It assumes that cross-sectional slopes vary randomly around a common average. The cross-section specific components have a mean of zero. This type of model is a convenient tool to assess if there are differences in predictive power across sections.

Logit regression

What is it?

Logistic Regression is a classification algorithm used to predict a binary outcome (0 or 1) based on probability that is related to a set of independent variables. It operates like a special case of linear regression with a log of odds (i.e. relative probabilities) as dependent variable. Logistic regression is part of a larger class of algorithms known as Generalized Linear Model (GLM).

How does it support trading?

Logit regression is the standard method to assess whether a specific market event is likely to happen, based on available data. Predictable events could be the break of fixed exchange rate regimes or debt default in emerging economies.

How can it be done in R?

In R one can estimate logistical regression with the glm(form, family, data) function by choosing the binomial family and with the link ‘logit’.

Bayesian estimation

What is it?

Bayesian analysis is a statistical paradigm used to characterize unknown parameters by use of probability statements. In Bayesian statistics, probabilities are in the mind, not the world. Probability is a description of how certain we are that some statement is true. When we get new information, we update our probabilities to take the new information into account. Bayesian inference can be condensed into four steps:

  • A Bayesian analysis starts by choosing values for the prior probabilities of various hypotheses or a prior distribution in continuous form. Prior probabilities describe our initial uncertainty, before taking the data into account. They apply judgment based on information outside the research at hand.
  • Data are taken into account in the form of likelihoods. In general, likelihood is the probability of observing the data at hand, assuming that a particular hypothesis is true. Simplistically we go through each hypothesis in turn and ask “What is the probability of observing the actual data under this hypothesis?”.
  • The unnormalized posterior is calculated as the product of prior probabilities and likelihoods. These values are also called “prior times likelihood”. They do not sum to 1 but are proportional to the actual posterior probabilities. To obtain the posterior probabilities, one must divide the unnormalized posterior by its sum, producing numbers that do sum to This posterior distribution is the basis of Bayesian inference. It is a probabilistic assessment of the state of the world in light of prior uncertainty and data.

How does it support trading?

A key application of Bayesian analysis is the assessment of the systematic or rules-based trading strategies. Rather than focusing on the backtest of a specific, optimized trading strategy, Bayesian analysis can use multiple plausible but non-optimized version of the trading idea and prior expert judgment to arrive at probability distributions for returns, volatility, and maximum drawdowns.

How can it be done in R?

The BEST package formally provides a Bayesian alternative to a t-test. This means it allows assessing the probability that population parameters are significantly above specific value or significantly different from another population’s parameter. Practically, this gives us tools to assess the probability that a trading strategy is profitable in the short or long run or the probability that it is an improvement over another strategy.

Compared to the p-value of a classic t-test, the analytics of the BEST package provide more information, because it considers the full distribution. The core function of the package is BESTmcmc(). It passes a description of model, priors, and data and returns Markov chain Monte Carlo (MCMC) samples for the parameters.

For example, this can be used to assess the periodic average returns of a trading strategy, by passing multiple instances of returns on that type of strategy (over different time periods and model versions) alongside some prior judgment on the distribution of such returns to the BESTmcmc() function. In general the impact of the priors should be the stronger [1] the more informative they are, i.e. the tighter the distribution, [2] the fewer data points are passed, and [3] the worse the fit of the data to the distribution. The function returns a BEST dataframe that informs on the posterior distribution of mean and variability of returns. The default plot of the BEST dataframe object is a commented histogram of the posterior distribution of the return mean. Per default it quantifies the mean, the extreme values, i.e. the lower and upper thresholds outside the 95% confidence interval, and the probability that the mean is below and above zero.

Learning

What is it?

Here learning refers to supervised machine learning (view post here). These are methods that find functional forms of prediction models in a manner that optimizes out-of-sample forecasting. Learning mainly serves prediction, whereas classical econometrics mainly estimates specific structural parameters. Also, supervised machine learning emphasizes past patterns in data rather than top-down theoretical priors. The prediction function is typically found in two stages: [1] picking the “best” form conditional on a given level of complexity and [2] picking the “best” complexity based on past out-of-sample forecast performance.

How does it support trading?

Supervised machine learning has two key applications in investment management: forecasting and backtesting. It is particularly valuable for forecasting in fields where the amount of data is vast and the insights into their structure or other theoretical priors are sparse. Supervised machine learning helps backtesting when decisions about the exact form of a trading strategy must be made without much theoretical guidance. In this case, the backtest is more about the method of choosing a predictor function than about the actual current predictor function.

How can it be done in R?

A very useful set of tools for machine learning in R is provided by the caret package (short for Classification And REgression Training). The package offers a set of functions to streamline the process for creating predictive models, including for data splitting, pre-processing, feature selection, model tuning using resampling and variable importance estimation. The key functions of this package are well explained in the DataCamp course on the “machine learning toolbox”.

Another useful set of functions is provided by the glmnet package, which fits lasso and elastic-net model paths for regression, logistic and multinomial regression using coordinate descent. The algorithm is considered to be extremely fast. Put simply, glmnet is an extension of the generalized linear regression model (GLM) that places constraints on the magnitude of the coefficients to prevent overfitting and – in some cases – select variables. This is more commonly known as “penalized” regression modeling and is a very useful technique on datasets with many predictors and few values.