This notebook uses tidyverse and other related packages and simulated example data for (equity index) returns and local return factors.
cv.packs <- c("tidyverse", "lubridate", "xts", "rsample", "recipes", "parsnip", "yardstick", "workflows", "tune", "dials")
for (pack in cv.packs) require(pack, character.only = TRUE)
set.seed(110)
dv.meds <- seq(as.Date("2010-02-01"), length = 126, by = "1 month") - 1 # date vector of month-end dates
cv.cas <- c("EUR", "USD", "JPY", "CAD", "GBP", "AUD") # character vector of currency area names
cv.fac <- str_c("FACT", 1:9) # character vector of factor names
cv.alf <- outer(cv.cas, cv.fac, str_c, sep = "_") %>% as.vector() %>% sort() # character vector of all factor series names
nm <- length(dv.meds)
nc <- length(cv.cas)
na <- length(cv.alf)
tbl.fac <- matrix(rnorm(nm * na, 0, 1), nm, na) %>%
as_tibble %>% set_names(cv.alf)
tbl.eqr <- matrix(NA, nm, nc) %>% as_tibble %>% set_names(str_c(cv.cas, "_EQ_XR"))
for (ca in cv.cas) { # simulate equity returns related to local factors
tbl.eqr[str_c(ca, "_EQ_XR")] <- as.matrix(tbl.fac[str_c(ca, cv.fac, sep = "_")]) %*% rnorm(9, 0, 1) + rnorm(nm, 0.5, 5)
}
tbm <- bind_cols(as_tibble(dv.meds) %>% set_names("date"), tbl.eqr, tbl.fac) # simulated tibble data set
tbm %>% print()
tidyr
package¶The purpose of the tidyr
package is to make a dataset tidy, organizing it in a standardized form. Standardization facilitates subsequent operations, estimation and analysis, particularly for the other packages of the "tidyverse", such as dplyr
or ggplot2
. A tidy data set is data table that meets the following conditions:
Pivoting makes tables longer and wider, by combining or splitting the contents of columns. In the context of macroeconomic and financial analysis this serves common tasks, such as converting wide data formats (that are akin to spreadsheets and used in xts/zoo objects) to long data formats (as prevalent in SQL) and vice versa.
tbm.lon <- tbm %>% tidyr::pivot_longer(-date, names_to = "rtype", values_to = "rval") # create "long" data frame
print(tbm.lon)
tbm.wid <- tbm.lon %>% tidyr::pivot_wider(names_from = "rtype", values_from = "rval") # long format to wide
xdm.wid <- xts(tbm.wid %>% select(-date), tbm.wid$date) # convert to xts
str(xdm.wid) %>% print()
Nesting creates a list-column of data frames. Nesting is implicitly a summarising operation: it produces one row for each group defined by the non-nested columns. This is useful in conjunction with other summaries that work with whole datasets, most notably models.
The function tidyr::nest(.data, ...)
takes a dataframe (.data
) and a tidy-selected group of columns to nest (...
), in the form new_col = c(col1, col2,...)
. The right hand side can be any tidy select expression.
tbm.eqr <- tbm %>%
mutate(year = year(date)) %>%
nest(data = -year) %>% # data list column that contains one data tibble per year
tail(5) %>% print()
dplyr
package¶The dplyr
package is one of the key workhorse modules in R. Its purpose is to transform data tables efficiently. Transformations include [1] grouping and summarizing cases (rows), [2] extracting, arranging and adding rows, [3] extracting, changing and adding new variables (columns), and [4] combining tables through "mutating joins" or simple row bindings.
The basic object used by the dplyr
package is the tidy tibble. Financial and economic time series data frames need to be converted to this class. Importantly, tidy data do not use row names. Information stored in row names, such as dates in zoo and xts objects, should be be added as an explicit variable in the tibble.
The grouping and summarising functions of dplyr
allow swift construction of tables of grouped summary statistics, such as mean, standard deviations, benchmark correlation and so forth. Such tables combine observed cases by a grouping criterion (such as a time period, a particular market state or value of an economic trend) and display the summary statistics' value of a chosen group of variables (for example asset class returns).
The benefit of dplyr
for this purpose are [1] simple human readable code, [3] a wide range of grouping criteria based on any of the columns in the tibble, and [3] easy selection of variables (columns) by use of dplyr::select()
or dplyr::across()
in conjunction with helper functions. Data selection can be based on unquoted data variables, character vectors or regular expressions. Moreover, the dplyr::arrange()
function makes in easy to rank columns on the summary table.
tbm %>% group_by(year(date)) %>% # actual grouping function
select(date | ends_with("EQ_XR")) %>%
summarise_if(is.numeric, median, na.rm = T) %>% # median of all numeric columns is calculated by year group
slice_tail(n = 3) %>% print()
Summaries can also be two-dimensional, i.e. reach over column and row groups. The function dplyr::rowwise()
allows to compute on a data frame one row at a time. One can use it to display summaries or perform mutations “by row”. This can make it particularly useful in dataframes with list-columns. Like dplyr::group_by()
the function dplyr::rowwise()
does not change what the data looks like; it changes how dplyr verbs operate on the data.
tbm %>% mutate(year = year(date)) %>%
rowwise() %>% # creates modified copy of tibble
mutate(USD_BIGFACT = mean(USD_FACT1:USD_FACT9)) %>% # calculation across rows
select(year | date | USD_BIGFACT) %>%
group_by(year) %>% # actual grouping function
summarise(USD_BIGFACT = mean(USD_BIGFACT, na.rm = T), .groups = "drop") %>%
slice_tail(n = 5) %>% print()
The dplyr
functions allow transforming groups or panels of variables in one single operation with transparanet code.
Columnwise operations with the dplyr::across(.cols,.fns)
function perform one and the same operation on multiple columns of a table. These operations are particularly important inside summarise()
and mutate()
and now supersede scoped functions (functions that end on _if
, _at
or _all
). Columnwise opeprations allow summarising or mutating data panels in one go. Since one can apply a whole list of functions to a panel in one go it is a very powerful method for performing many operations with a few short lines of code.
.cols
, selects the columns to operate on. It uses tidy selection (like select()
) to pick variables by position, name, and type. One can also use a character vector, for example in conjunction with dplyr::all_of()
..fns
, is a function or list of functions to apply to each column. The argument also accepts a purrr
style formula (or list of formulas) like ~ .x / 2
. Reference: https://dplyr.tidyverse.org/articles/colwise.html
tbm %>% summarise(across(USD_FACT1:USD_FACT9, ~mean(.x, na.rm = TRUE))) %>% print() # apply function across columns
l.fs <- list(sign = ~sign(.x), abs = ~abs(.x)) # list of functions
tbm.mod <- tbm %>%
transmute(across(c(EUR_EQ_XR, USD_EQ_XR), l.fs)) # apply list of functions over muliple columns
print(slice_tail(tbm.mod, n = 5))
purrr
package¶The tidyverse package purrr
enhances R’s functional programming (FP) toolkit. It offers a complete and consistent set of tools for applying functions to data structures, working with lists in general, reducing lists, and modifying function behaviour.
The family of map()
functions allows to replace many standard loops with code that is more succinct and easier to read. It generalizes the application of all sorts of operations to sets of variables in a dataframe. The map
function family is similar to the apply
function family in base R, but offers more variation and is more consistent in the format of output values.
In general, purrr::map()
returns a list or data frame. However, type of object that is returned can be determined by choosing map functions with appropriate suffix.
The simplest function purrr::map(.x, .f,...)
transforms its input by applying a function (.f
) to each element of a list, dataframe or atomic vector (.x
) and returning an object of the same length as the input. Put simple, the function `.f` iterates over each element of the list `.x`. The map functions also use the ...
(“dot dot dot”) argument to pass along additional arguments to .f
at each interative call. Multiple arguments can be passed along using commas to separate them.
Since dataframes are lists of vectors, the purrr::map()
function is suitable for operating on dataframe columns. Thus, the function can be applied to selected columns.
tbm %>% select(starts_with(c("EUR", "USD")) & ends_with("FACT1")) %>% map(summary)
The standard purrr::map(x.=list, .f = function)
syntax works often. However, for cases where one must specify how the list is used a more explicit version is required. An explicit syntax can be based on a one-sided function, called a mapper, and takes the form purrr::map(list, ~function(.x))
. It uses the argument .x
, .
, or ..1
to denote where the list element goes inside the function.
Mappers are short-hand one-sided functions that are often used like lambda functions. However, mappers can also be named and reused, very much like regular functions, by using the purrr::as_mapper()
function. Applied in an assignment statement this function gives reusable mappers.
returnLast12Months <- as_mapper(~sum(tail(.x, 12))) # reusable mapper
tbm %>% select(ends_with("EQ_XR")) %>% map_dbl(returnLast12Months) %>% print() # apply to subset of data table
The power and benefit of the purrr:map()
family becomes evident when considering the diversity of some other members:
purrr::map2(.x, .y, .f, ...)
is used to iterate over two lists (.x
, and .y
) at the same time. This means it applies the function to pairs of arguments.purrr::pmap(.l, .f)
is used to iterate over a master list, i.e. a list of lists or list of vectors (.l
). In particular, thus functional applies the function .f.
to parallel sets of inner list elements inside the masterlist. Instead of using `.x` or `.y` this functional uses the sublist names as arguments .purrr::invoke_map(.f, .x,...)
invokes a list of functions .f
for a argument list .x
that is either of length 1 or of the same length as .f
. v.rxr <- tbm %>% select(USD_EQ_XR, EUR_EQ_XR, JPY_EQ_XR) %>%
pmap_dbl(~..1 - 0.5 * (..2 + ..3)) # apply mapper with specific role of each series in expression
str(v.rxr)
broom
package¶The broom
package addresses the model representation problem in R. Mathematical models have shared notation and community standards. By contrast, trained R models follow few community standards. Information on trained R models is collected in lists of different shapes. Extracting specific information across models is therefore not straightforward. For example, R model class probabilities have typically different interfaces.
The broom
package provides a standardized way of representing trained models. In particular, it delivers three S3 generics to extract trained model information:
broom::tidy()
returns a standardized tibble with consistent column names that provides information about fited coefficientsbroom::glance()
always returns a one-row tibble with consistent column names of goodness of fit measuresbroom::augment()
adds information about observations to a dataset (such as fitted values and standard errors of predictions) and to get predictions on new data.The output of the above broom functions is always a tibble. Also, it never has rownames (which would prevent dataframes with duplicates from being stackable). Most importantly, column names are kept consistent, so that they can be combined across different models. These features make sure that model output is "tidy", i.e. easily usable for further work, and also stackable.
While broom
is useful for summarizing the result of a single analysis in a consistent format, it is really designed for high-"throughput" applications that combine results from multiple analyses.
The function broom::tidy(x,...)
summarizes the results of an estimation or test in a tidy tibble, where each row contains a category of information. For regression models, these results typically are information sets related to coefficients. Standardized representation is particularly useful for joints with other similar estimations, as well as quick inspections and custom visualizations.
mod.ols <- lm(USD_EQ_XR ~ USD_FACT1 + USD_FACT2 + USD_FACT3, tbm)
tbl.olt <- broom::tidy(mod.ols) # standard tibble for linear coefficient estimation
print(tbl.olt)
mod.nls <- nls(USD_EQ_XR ~ c + b1 * USD_FACT1 + b2/USD_FACT2 + b3/USD_FACT3, tbm, start = list(c = 0, b1 = 1, b2 = 1, b3 = 1))
tbl.nlt <- broom::tidy(mod.nls) # standardized tibble for non-linear coefficient estimation
print(tbl.nlt)
The function broom::glance()
returns a tibble with exactly one row that contains model-specific goodness of fitness measures and related statistics. This is useful to check for model misspecification and to compare a variety models.
Goodness-of-fit measures differ across models but their names are standardized so that comparisons can focus on the common ones.
map_lm <- as_mapper(~lm(as.formula(str_c("USD_EQ_XR ~ ", str_c("USD_FACT", 1:.x) %>% str_c(collapse = " + "))), tbm))
l.mods <- map(1:9, map_lm) # creates list of linear regression models with incremental factor number
tbl.cmp <- purrr::map_df(l.mods, broom::glance, .id = "model") %>% arrange(AIC) # ranked tibble of model fits
print(tbl.cmp)
lubridate
package¶The lubridate
package facilitates work and coding with data and date-time objects. In particular, it supports the parsing of date-times (i.e. converting strings or numbers into proper date-time objects), getting and setting components of date-time objects, rounding date-times and - in particular - mathematical operations with date-times, based on consistent timelines. This means operations are robust to time zones, leap days, daylight savings times, and other time-related idiosyncracies.
Periods track how much the clock moves forward, regardless of what time actually passes. They ignore time line irregularities. One adds or subtracts periods in order to create records at specific clock times, irrespective of how much time has actually passed. Base R makes periods with functions that bear the neame of the time unit pluralized, such as years(x=1)
, months(x)
or weeks(x)
. More generally lubridate::period(num, units)
is an automation-friendly period constructor that can handle various types of periods.
l.units <- list(10, 20, 30)
l.freqs <- list("months", "weeks", "days")
per <- map2(l.units, l.freqs, ~period(.x, units = .y)) %>% reduce(`+`) %>% print() # add up list of different period types
print(lubridate::ymd("2020-06-08") + 2 * per) # periods can be added to a date and get new date
Durations track the passage of physical time, which deviates from clock time when irregularities occur. One usually adds or subtracts durations to model physical processes, like battery life, but they also matter for trading rule and the assessment of the impact of economic-financial shocks on price dynamics. Durations are stored as seconds, which is the only time unit with consistent length. The difftime
objects in base R are a form of duration.
Lubridate makes periods with functions that bear the name of the period prefixed with a d
, such as lubridate::dhours()
, lubridate::ddays()
, or lubridate::years()
. All of these are considered to have a fixed number of seconds. There is no lubridate::dmonths()
, because calendar months cannot be standardized.
print(ymd("2020-01-01") + years(1)) # adding one year period
print(ymd("2020-01-01") + dyears(1)) # adding one year duration (in leap year does not reach the next year)
Intervals represent clock time passage (like periods) on a particular part of the timeline. Unlike periods, intervals are defined by specific start and end times, rather than an amount of time. If summer time sets clocks forward an interval of one hour arises, even if no time has passed. Technically, intervals represent the spaces in boundaries.
Intervals can be created with the lubridate::interval(dt1, dt2)
function or the dt1 %--% dt2
operator, both of which take date-time objects as arguments. Note that the interval between two units is one unit. For example, in the case of days the function counts from and to the same point within a day.
itv1 <- interval(ymd("2019-01-01"), ymd("2020-01-01"))
print(str(itv1))
itv2 <- dmy("1 Jan 2020") %--% dmy("1 Jan 2021")
print(itv2)
There is a range of useful operations that can be performed with intervals:
dt %within% itv
operator checks if a date (or interval) falls within an interval (useful for blacklisting);lubridate::int_start(int)
and lubridate::int_end(int)
get and set first and last dates;lubridate::int_length(int)
gives the interval length in seconds;lubridate::int_aligns(int1, int2)
, lubridate::int_overlaps(int1, int2)
check if intervals have common boundary/overlap;lubridate_shift(int, by)
shift an interval along the timeline by an interval;int_diff(times)
turns the date-times in a vector into intervals;itv <- interval(ymd("2019-01-01"), ymd("2020-01-01"))
ymd("2019-02-02") %within% itv %>% print # check if date falls into interval
interval(ymd("2019-09-01"), ymd("2020-05-01")) %within% itv %>% print # envelopment needs to be complete to give TRUE
stringr
package¶The stringr
package makes working with strings in R easier, particularly when this work involves regular expressions (regex). The package functions facilitate [1] detecting string pattern matches, [2] subsetting character vectors, [3] managing the length of strings, [4] changing strings by rules, [5] splitting and joining strings, and [6] ordering strings. This is a great benefit for workflows with financial and economic data structures, where time series are created in groups with references to their names across countries or markets.
The functions of the stringr
package have the following consistent features:
str_
.pattern
argument that accepts regex.The function stringr::str_detect(cv, pattern)
checks if the strings in the character vector cv
contain a specific pattern
. It returns a logical vector of the same length as the input character vector, with TRUE for elements that contain the pattern
and FALSE otherwise. Analogously, stringr::str_which(cv, pattern)
returns the indexes of strings that match the pattern. These functions are particularly powerful with regular expressions. However, if a regular expression is not needed, the stringr::fixed()
function specifies that a pattern is a fixed string. This can yield substantial speed ups. Detecting pattern matches is highly useful in operating with data series in larger data sets by name.
str_which(names(tbm), "(^EUR|^USD).+XR$") %>% print() # indexes of strings beginning with EUR/USD that contain XR at the end.
tidymodels
packages¶Tidymodels packages provide tidyverse-friendly interfaces for model implementation. They have taken the role of the tidyverse’s machine learning toolkit. The tidymodels packages do not implement statistical models themselves. Rather they focus on facilitating the surrounding tasks:
rsample
),recipes
),parsnip
),(tune
and dials
),yardstick
), andworkflows
).These critical functions are usually performed in sequence and give an efficient standardized workflow.
The rsample package
is useful for both random and time (ordered) splits.
rsample::initial_split(data, prop = 3/4,...)
creates a single binary random split of the data into a training set and testing set.rsample::initial_time_split(data, prop = 3/4, lag = 0, ...)
does the same, but takes the first proportion of the sample for training, instead of a random selection. rsample::vfold_cv()
function does such random splitting.rsample::rolling_origin(data, initial = 5, assess = 1, cumulative = TRUE, skip = 0, lag = 0,...)
resamples consecutively across time.All of the above give rset
objects that denote splits. The splits can be used after being passed as arguments to rsample::training()
and rsample::testing()
functions. The functions extract actual training and testing samples.
A popular standard of data partioning is an initial split into (broad) training and test sets and a - subsequent - split of the broad training set into various (core) training and validation folds. The latter typically serves model tuning, i.e. the selection of the model with the best tuning parameters.
tbm.usd <- tbm %>% select(starts_with("USD"))
rset.in <- rsample::initial_split(tbm.usd, prop = .7) # initial 70%/30% split object
tbl.trn <- rsample::training(rset.in) # broad training set
tbl.tst <- rsample::testing(rset.in) # test set
rset.cv <- rsample::vfold_cv(tbl.trn) # cross-validation folds of the broad training set
rset.cv %>% print() # for illustration
The recipes
package provides an interface that facilitates data pre-processing for a specific estimation formula. It specifies the role of each variable as an outcome or predictor variable. And, for intuition, the package functions are named after cooking actions.
The function recipe(data, form,...)
creates a recipe
object based principally on the model formula form
. The initial recipe
object becomes the basis for subsequent transformations. Each data transformation is called a step. The package offers functions that correspond to specific types of steps, such as normalization or elimination of highly correlated variables, each prefixed with step_
. There are nearly one hundred step functions available. Information on the scope of available step functions can be found here.
A step can be applied to a specific variable, groups of variables, or all variables. For the purpose of selection, the recipes
package offers special selector functions. Thus, has_role()
, all_predictors()
, and all_outcomes()
select variables according to their role in the estimation formula. Similarly, has_type()
, all_numeric()
, and all_nominal()
are used to select columns based on data type. Moreover, select helpers from the tidyselect package can be used, just as in dplyr::select()
, including starts_with()
, contains()
and so forth.
After specifying the transformations with step functions, prep()
executes the transformations on top of the data that is supplied. This produces a modified recipe whose step objects have been updated with the required quantities.
rec.usd <- recipes::recipe(USD_EQ_XR ~ ., data = tbl.trn) %>%
recipes::step_corr(all_predictors(), threshold = 0.9) %>% # correlation filter on predictors
recipes::step_center(all_predictors()) %>% # centering
recipes::step_scale(all_predictors()) %>% # scaling to SD 1
print(rec.usd)
After a recipe
object has been created for a training data set, testing data can be transformed in exactly the same way but out of sample by using the bake()
function. For a recipe with at least one pre-processing operation that has been trained by prep.recipe()
computations are applied to the new data.
The parsnip
package focuses on standardizing model interfaces and return values. When using parsnip
, one does not have to remember each interface and unique argument names. This facilitates coding across R packages. Information on the scope of supported models can be found here.
As model type that typically requires tunining is elastic net based on the glmnet
package. In a tidyverse workflow one calls such a model through the parsnip
interface, while leaving open the tuning parameters, i.e the penalty for regressors and the weight of LASSO versus ridge regression.
mod.elnet <- parsnip::linear_reg(penalty = tune::tune(), mixture = tune::tune()) %>%
parsnip::set_engine("glmnet")
print(mod.elnet)
A workflow is an object that can bundle together pre-processing, modeling, and post-processing. For example, a recipe
object and a parsnip
model can be combined into a workflow.
wfl.elnet <- workflows::workflow() %>%
workflows::add_recipe(rec.usd) %>% # add formula-based recipe to workflow
workflows::add_model(mod.elnet) # add general model type to workflow
The goal of the tune
package is to facilitate hyperparameter tuning for the tidymodels packages. It relies heavily on recipes
, parsnip
, and dials
. Model parameter tuning is accomplished by training models with different specification and testing the predictive success. By running a "grid" (various combinations) of parameter values we can build hundreds or thousands of models. Validation will then identify the "best" model, i.e. the model with optimized tuning parameters.
Here we use the tune::tune_grid()
function based on the pre-set workflow and the cross-validation folds to obtain tuning results for a grid that is being created (here by an integer). The chosen metric for evaluation here is the root mean-squared error.
trs.elnet <- wfl.elnet %>% # create tuning result object
tune_grid(resamples = rset.cv, grid = 10, metrics = yardstick::metric_set(yardstick::rmse))
trs.elnet %>% tune::collect_metrics() %>% print() # view summary of tuning results.
A tibble with the optimized parameters can be extracted trough tune::select_best()
.
tbl.ops <- trs.elnet %>% tune::select_best(metric = "rmse") # select best specification
print(tbl.ops) # tibble with optimized parameters
With tune::finalize_workflow()
and parsnip::fit()
one can then refit the the optimally tuned model based on the whole broad training sample.
wfl.reg <- wfl.elnet %>%
tune::finalize_workflow(tbl.ops) %>% # attach the best tuning parameters to the model
parsnip::fit(data = tbl.trn) # fit the final model to the training data
The refitted optimizal model can then be applied to the test set, allowing a validation of the overall learning method out of sample.
The tidymodels package yardstick
provides tidy characterizations of model performance. The yardstick::metrics(data, truth, estimate,...)
function produces common performance measures of a model. It will automatically choose metrics appropriate for the given type of model. The function expects a tibble with columns that contains the actual results (truth
) and what the model predicted (estimate
).
wfl.reg %>%
predict(new_data = tbl.tst) %>% # predict test set
bind_cols(tbl.tst, .) %>% # combine with actual
select(USD_EQ_XR, .pred) %>%
yardstick::metrics(USD_EQ_XR, .pred) %>% # validate
print()