R tidyverse for macro trading research

Packages and simulated data

This notebook uses tidyverse and other related packages and simulated example data for (equity index) returns and local return factors.

In [29]:
cv.packs <- c("tidyverse", "lubridate", "xts", "rsample", "recipes", "parsnip", "yardstick", "workflows", "tune", "dials")
for (pack in cv.packs) require(pack, character.only = TRUE)
In [30]:
set.seed(110)
dv.meds <- seq(as.Date("2010-02-01"), length = 126, by = "1 month") - 1  # date vector of month-end dates
cv.cas <- c("EUR", "USD", "JPY", "CAD", "GBP", "AUD")  # character vector of currency area names
cv.fac <- str_c("FACT", 1:9) # character vector of factor names
cv.alf <- outer(cv.cas, cv.fac, str_c, sep = "_") %>% as.vector() %>% sort()  # character vector of all factor series names
nm <- length(dv.meds)
nc <- length(cv.cas)
na <- length(cv.alf)
tbl.fac <- matrix(rnorm(nm * na, 0, 1), nm, na) %>% 
  as_tibble %>% set_names(cv.alf)
tbl.eqr <- matrix(NA, nm, nc) %>% as_tibble %>% set_names(str_c(cv.cas, "_EQ_XR"))
for (ca in cv.cas) {  # simulate equity returns related to local factors
  tbl.eqr[str_c(ca, "_EQ_XR")] <- as.matrix(tbl.fac[str_c(ca, cv.fac, sep = "_")]) %*% rnorm(9, 0, 1) + rnorm(nm, 0.5, 5)
}
tbm <- bind_cols(as_tibble(dv.meds) %>% set_names("date"), tbl.eqr, tbl.fac)  # simulated tibble data set
tbm %>% print()
# A tibble: 126 x 61
   date       EUR_EQ_XR USD_EQ_XR JPY_EQ_XR CAD_EQ_XR GBP_EQ_XR AUD_EQ_XR
   <date>         <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 2010-01-31    11.7        5.27     -5.81     -3.05    -4.27    -10.2  
 2 2010-02-28     4.02       1.34     -2.24     -9.71     3.19     -5.81 
 3 2010-03-31    -4.53      -4.42      3.05     16.7     -2.75     -0.633
 4 2010-04-30     7.11       1.62      7.11     -2.11     5.73      6.42 
 5 2010-05-31     3.49       8.43     -3.53    -17.2     11.3      -3.76 
 6 2010-06-30     6.48       1.34     -6.57     -5.03    -0.588     4.79 
 7 2010-07-31     1.98      12.5      -3.70      5.85   -10.1       3.93 
 8 2010-08-31     0.332     -9.88      6.15      3.85     5.73      6.40 
 9 2010-09-30    -2.37       4.97      3.90      8.35     9.17     -4.69 
10 2010-10-31    -0.445      7.50     -2.75     -2.92    -0.414    11.3  
# ... with 116 more rows, and 54 more variables: AUD_FACT1 <dbl>,
#   AUD_FACT2 <dbl>, AUD_FACT3 <dbl>, AUD_FACT4 <dbl>, AUD_FACT5 <dbl>,
#   AUD_FACT6 <dbl>, AUD_FACT7 <dbl>, AUD_FACT8 <dbl>, AUD_FACT9 <dbl>,
#   CAD_FACT1 <dbl>, CAD_FACT2 <dbl>, CAD_FACT3 <dbl>, CAD_FACT4 <dbl>,
#   CAD_FACT5 <dbl>, CAD_FACT6 <dbl>, CAD_FACT7 <dbl>, CAD_FACT8 <dbl>,
#   CAD_FACT9 <dbl>, EUR_FACT1 <dbl>, EUR_FACT2 <dbl>, EUR_FACT3 <dbl>,
#   EUR_FACT4 <dbl>, EUR_FACT5 <dbl>, EUR_FACT6 <dbl>, EUR_FACT7 <dbl>,
#   EUR_FACT8 <dbl>, EUR_FACT9 <dbl>, GBP_FACT1 <dbl>, GBP_FACT2 <dbl>,
#   GBP_FACT3 <dbl>, GBP_FACT4 <dbl>, GBP_FACT5 <dbl>, GBP_FACT6 <dbl>,
#   GBP_FACT7 <dbl>, GBP_FACT8 <dbl>, GBP_FACT9 <dbl>, JPY_FACT1 <dbl>,
#   JPY_FACT2 <dbl>, JPY_FACT3 <dbl>, JPY_FACT4 <dbl>, JPY_FACT5 <dbl>,
#   JPY_FACT6 <dbl>, JPY_FACT7 <dbl>, JPY_FACT8 <dbl>, JPY_FACT9 <dbl>,
#   USD_FACT1 <dbl>, USD_FACT2 <dbl>, USD_FACT3 <dbl>, USD_FACT4 <dbl>,
#   USD_FACT5 <dbl>, USD_FACT6 <dbl>, USD_FACT7 <dbl>, USD_FACT8 <dbl>,
#   USD_FACT9 <dbl>

The tidyr package

General benefits

The purpose of the tidyr package is to make a dataset tidy, organizing it in a standardized form. Standardization facilitates subsequent operations, estimation and analysis, particularly for the other packages of the "tidyverse", such as dplyr or ggplot2. A tidy data set is data table that meets the following conditions:

  • Each column represents one variable, whose values belong to a single attribute across observational units.
  • Each row represents one observation of these variables; row values all are attributes of this observation.
  • There is only one type of observational units (e.g. years, months, cross-sections) per table. If one wishes to investigate data sets with different observational units, for example times series and panels, one should use different tables.
    Examples of "messy data" are variables that are stored in both rows and columns and column headers that are values rather than variables.

Example: Pivoting data tables

Pivoting makes tables longer and wider, by combining or splitting the contents of columns. In the context of macroeconomic and financial analysis this serves common tasks, such as converting wide data formats (that are akin to spreadsheets and used in xts/zoo objects) to long data formats (as prevalent in SQL) and vice versa.

In [3]:
tbm.lon <- tbm %>% tidyr::pivot_longer(-date, names_to = "rtype", values_to = "rval")  # create "long" data frame
print(tbm.lon)
# A tibble: 7,560 x 3
   date       rtype        rval
   <date>     <chr>       <dbl>
 1 2010-01-31 EUR_EQ_XR  11.7  
 2 2010-01-31 USD_EQ_XR   5.27 
 3 2010-01-31 JPY_EQ_XR  -5.81 
 4 2010-01-31 CAD_EQ_XR  -3.05 
 5 2010-01-31 GBP_EQ_XR  -4.27 
 6 2010-01-31 AUD_EQ_XR -10.2  
 7 2010-01-31 AUD_FACT1   0.291
 8 2010-01-31 AUD_FACT2  -0.490
 9 2010-01-31 AUD_FACT3  -0.204
10 2010-01-31 AUD_FACT4  -0.716
# ... with 7,550 more rows
In [4]:
tbm.wid <- tbm.lon %>% tidyr::pivot_wider(names_from = "rtype", values_from = "rval")  # long format to wide
xdm.wid <- xts(tbm.wid %>% select(-date), tbm.wid$date)  # convert to xts
str(xdm.wid) %>% print()
An 'xts' object on 2010-01-31/2020-06-30 containing:
  Data: num [1:126, 1:60] 11.73 4.02 -4.53 7.11 3.49 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:60] "EUR_EQ_XR" "USD_EQ_XR" "JPY_EQ_XR" "CAD_EQ_XR" ...
  Indexed by objects of class: [Date] TZ: UTC
  xts Attributes:  
 NULL
NULL

Example: Nesting data tables

Nesting creates a list-column of data frames. Nesting is implicitly a summarising operation: it produces one row for each group defined by the non-nested columns. This is useful in conjunction with other summaries that work with whole datasets, most notably models.

The function tidyr::nest(.data, ...) takes a dataframe (.data) and a tidy-selected group of columns to nest (...), in the form new_col = c(col1, col2,...). The right hand side can be any tidy select expression.

In [5]:
tbm.eqr <- tbm %>%
    mutate(year = year(date)) %>% 
    nest(data = -year) %>%  # data list column that contains one data tibble per year
    tail(5) %>% print()
# A tibble: 5 x 2
   year data              
  <dbl> <list>            
1  2016 <tibble [12 x 61]>
2  2017 <tibble [12 x 61]>
3  2018 <tibble [12 x 61]>
4  2019 <tibble [12 x 61]>
5  2020 <tibble [6 x 61]> 

The dplyr package

General benefits

The dplyr package is one of the key workhorse modules in R. Its purpose is to transform data tables efficiently. Transformations include [1] grouping and summarizing cases (rows), [2] extracting, arranging and adding rows, [3] extracting, changing and adding new variables (columns), and [4] combining tables through "mutating joins" or simple row bindings.

The basic object used by the dplyr package is the tidy tibble. Financial and economic time series data frames need to be converted to this class. Importantly, tidy data do not use row names. Information stored in row names, such as dates in zoo and xts objects, should be be added as an explicit variable in the tibble.

Example: Grouped summaries

The grouping and summarising functions of dplyr allow swift construction of tables of grouped summary statistics, such as mean, standard deviations, benchmark correlation and so forth. Such tables combine observed cases by a grouping criterion (such as a time period, a particular market state or value of an economic trend) and display the summary statistics' value of a chosen group of variables (for example asset class returns).

The benefit of dplyr for this purpose are [1] simple human readable code, [3] a wide range of grouping criteria based on any of the columns in the tibble, and [3] easy selection of variables (columns) by use of dplyr::select() or dplyr::across() in conjunction with helper functions. Data selection can be based on unquoted data variables, character vectors or regular expressions. Moreover, the dplyr::arrange() function makes in easy to rank columns on the summary table.

In [6]:
tbm %>% group_by(year(date)) %>%  # actual grouping function
    select(date | ends_with("EQ_XR")) %>% 
    summarise_if(is.numeric, median, na.rm = T)  %>%   # median of all numeric columns is calculated by year group
    slice_tail(n = 3) %>% print()
Adding missing grouping variables: `year(date)`

# A tibble: 3 x 7
  `year(date)` EUR_EQ_XR USD_EQ_XR JPY_EQ_XR CAD_EQ_XR GBP_EQ_XR AUD_EQ_XR
         <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1         2018      4.96    -0.222     -1.54    -0.149     2.19      1.41 
2         2019     -1.66    -1.65       1.26    -0.647     2.91     -1.89 
3         2020     -4.76     4.06       2.90    -2.53     -0.630     0.334

Summaries can also be two-dimensional, i.e. reach over column and row groups. The function dplyr::rowwise() allows to compute on a data frame one row at a time. One can use it to display summaries or perform mutations “by row”. This can make it particularly useful in dataframes with list-columns. Like dplyr::group_by() the function dplyr::rowwise() does not change what the data looks like; it changes how dplyr verbs operate on the data.

In [7]:
tbm %>% mutate(year = year(date)) %>%
    rowwise() %>%  # creates modified copy of tibble
    mutate(USD_BIGFACT = mean(USD_FACT1:USD_FACT9))  %>% # calculation across rows
    select(year | date | USD_BIGFACT) %>% 
    group_by(year) %>%  # actual grouping function
    summarise(USD_BIGFACT = mean(USD_BIGFACT, na.rm = T), .groups = "drop")  %>% 
    slice_tail(n = 5) %>%  print()
# A tibble: 5 x 2
   year USD_BIGFACT
  <dbl>       <dbl>
1  2016     -0.120 
2  2017     -0.0709
3  2018      0.0310
4  2019      0.0173
5  2020     -0.120 

Example: Columnwise operations

The dplyr functions allow transforming groups or panels of variables in one single operation with transparanet code.

Columnwise operations with the dplyr::across(.cols,.fns) function perform one and the same operation on multiple columns of a table. These operations are particularly important inside summarise() and mutate() and now supersede scoped functions (functions that end on _if, _at or _all). Columnwise opeprations allow summarising or mutating data panels in one go. Since one can apply a whole list of functions to a panel in one go it is a very powerful method for performing many operations with a few short lines of code.

  • The first argument, .cols, selects the columns to operate on. It uses tidy selection (like select()) to pick variables by position, name, and type. One can also use a character vector, for example in conjunction with dplyr::all_of().
  • The second argument, .fns, is a function or list of functions to apply to each column. The argument also accepts a purrr style formula (or list of formulas) like ~ .x / 2.

Reference: https://dplyr.tidyverse.org/articles/colwise.html

In [8]:
tbm  %>% summarise(across(USD_FACT1:USD_FACT9, ~mean(.x, na.rm = TRUE)))  %>%  print()  # apply function across columns
# A tibble: 1 x 9
  USD_FACT1 USD_FACT2 USD_FACT3 USD_FACT4 USD_FACT5 USD_FACT6 USD_FACT7
      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1    0.0370   -0.0152    0.0576    -0.184    -0.134    0.0605   -0.0208
# ... with 2 more variables: USD_FACT8 <dbl>, USD_FACT9 <dbl>
In [9]:
l.fs  <- list(sign = ~sign(.x), abs = ~abs(.x))  # list of functions
tbm.mod  <- tbm  %>%
    transmute(across(c(EUR_EQ_XR, USD_EQ_XR), l.fs))  # apply list of functions over muliple columns
print(slice_tail(tbm.mod, n = 5))
# A tibble: 5 x 4
  EUR_EQ_XR_sign EUR_EQ_XR_abs USD_EQ_XR_sign USD_EQ_XR_abs
           <dbl>         <dbl>          <dbl>         <dbl>
1              1         15.3               1          2.33
2             -1          6.50              1          5.79
3             -1          4.55              1          6.86
4             -1          4.98             -1          7.30
5             -1          1.97              1          7.14

The purrr package

General benefits

The tidyverse package purrr enhances R’s functional programming (FP) toolkit. It offers a complete and consistent set of tools for applying functions to data structures, working with lists in general, reducing lists, and modifying function behaviour.

Example: Applying (potentially complex) operations to selections of the dataset

The family of map() functions allows to replace many standard loops with code that is more succinct and easier to read. It generalizes the application of all sorts of operations to sets of variables in a dataframe. The map function family is similar to the apply function family in base R, but offers more variation and is more consistent in the format of output values. In general, purrr::map() returns a list or data frame. However, type of object that is returned can be determined by choosing map functions with appropriate suffix.

The simplest function purrr::map(.x, .f,...) transforms its input by applying a function (.f) to each element of a list, dataframe or atomic vector (.x) and returning an object of the same length as the input. Put simple, the function `.f` iterates over each element of the list `.x`. The map functions also use the ... (“dot dot dot”) argument to pass along additional arguments to .f at each interative call. Multiple arguments can be passed along using commas to separate them.

Since dataframes are lists of vectors, the purrr::map() function is suitable for operating on dataframe columns. Thus, the function can be applied to selected columns.

In [10]:
tbm %>% select(starts_with(c("EUR", "USD")) & ends_with("FACT1")) %>% map(summary)
$EUR_FACT1
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.0523 -0.4359  0.1561  0.1808  0.7367  2.8435 

$USD_FACT1
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.41910 -0.61603 -0.03070  0.03696  0.72518  3.68923 

The standard purrr::map(x.=list, .f = function) syntax works often. However, for cases where one must specify how the list is used a more explicit version is required. An explicit syntax can be based on a one-sided function, called a mapper, and takes the form purrr::map(list, ~function(.x)). It uses the argument .x, ., or ..1 to denote where the list element goes inside the function.

Mappers are short-hand one-sided functions that are often used like lambda functions. However, mappers can also be named and reused, very much like regular functions, by using the purrr::as_mapper() function. Applied in an assignment statement this function gives reusable mappers.

In [11]:
returnLast12Months <- as_mapper(~sum(tail(.x, 12)))  # reusable mapper 
tbm %>% select(ends_with("EQ_XR")) %>% map_dbl(returnLast12Months) %>% print()  # apply to subset of data table
 EUR_EQ_XR  USD_EQ_XR  JPY_EQ_XR  CAD_EQ_XR  GBP_EQ_XR  AUD_EQ_XR 
-24.048168 -13.524282  46.657057 -31.661047  28.831413   1.910533 

The power and benefit of the purrr:map() family becomes evident when considering the diversity of some other members:

  • The functional purrr::map2(.x, .y, .f, ...) is used to iterate over two lists (.x, and .y) at the same time. This means it applies the function to pairs of arguments.
  • The functional purrr::pmap(.l, .f) is used to iterate over a master list, i.e. a list of lists or list of vectors (.l). In particular, thus functional applies the function .f. to parallel sets of inner list elements inside the masterlist. Instead of using `.x` or `.y` this functional uses the sublist names as arguments .
  • The functional purrr::invoke_map(.f, .x,...) invokes a list of functions .f for a argument list .x that is either of length 1 or of the same length as .f.
In [12]:
v.rxr <- tbm  %>% select(USD_EQ_XR, EUR_EQ_XR, JPY_EQ_XR) %>% 
    pmap_dbl(~..1 - 0.5 * (..2 + ..3))  # apply mapper with specific role of each series in expression
str(v.rxr)
 num [1:126] 2.306 0.446 -3.683 -5.494 8.456 ...

The broom package

General benefits

The broom package addresses the model representation problem in R. Mathematical models have shared notation and community standards. By contrast, trained R models follow few community standards. Information on trained R models is collected in lists of different shapes. Extracting specific information across models is therefore not straightforward. For example, R model class probabilities have typically different interfaces.

The broom package provides a standardized way of representing trained models. In particular, it delivers three S3 generics to extract trained model information:

  • broom::tidy() returns a standardized tibble with consistent column names that provides information about fited coefficients
  • broom::glance() always returns a one-row tibble with consistent column names of goodness of fit measures
  • broom::augment() adds information about observations to a dataset (such as fitted values and standard errors of predictions) and to get predictions on new data.

The output of the above broom functions is always a tibble. Also, it never has rownames (which would prevent dataframes with duplicates from being stackable). Most importantly, column names are kept consistent, so that they can be combined across different models. These features make sure that model output is "tidy", i.e. easily usable for further work, and also stackable.

While broom is useful for summarizing the result of a single analysis in a consistent format, it is really designed for high-"throughput" applications that combine results from multiple analyses.

Example: Model information in standardized tidy dataframes

The function broom::tidy(x,...) summarizes the results of an estimation or test in a tidy tibble, where each row contains a category of information. For regression models, these results typically are information sets related to coefficients. Standardized representation is particularly useful for joints with other similar estimations, as well as quick inspections and custom visualizations.

In [13]:
mod.ols <- lm(USD_EQ_XR ~ USD_FACT1 + USD_FACT2 + USD_FACT3, tbm)
tbl.olt <- broom::tidy(mod.ols)  # standard tibble for linear coefficient estimation
print(tbl.olt)
# A tibble: 4 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    0.320     0.501     0.639 0.524  
2 USD_FACT1      1.09      0.514     2.13  0.0353 
3 USD_FACT2     -1.45      0.534    -2.71  0.00760
4 USD_FACT3     -0.629     0.525    -1.20  0.233  
In [14]:
mod.nls <- nls(USD_EQ_XR ~ c + b1 * USD_FACT1 + b2/USD_FACT2 + b3/USD_FACT3, tbm, start = list(c = 0, b1 = 1, b2 = 1, b3 = 1))
tbl.nlt <- broom::tidy(mod.nls)  # standardized tibble for non-linear coefficient estimation
print(tbl.nlt)
# A tibble: 4 x 5
  term  estimate std.error statistic p.value
  <chr>    <dbl>     <dbl>     <dbl>   <dbl>
1 c       0.208     0.522      0.399  0.690 
2 b1      1.07      0.534      2.00   0.0477
3 b2      0.0421    0.0545     0.773  0.441 
4 b3     -0.0120    0.0133    -0.903  0.368 

Example: Comparable goodness-of-fit statistics in single-line tibbles

The function broom::glance() returns a tibble with exactly one row that contains model-specific goodness of fitness measures and related statistics. This is useful to check for model misspecification and to compare a variety models. Goodness-of-fit measures differ across models but their names are standardized so that comparisons can focus on the common ones.

In [15]:
map_lm <- as_mapper(~lm(as.formula(str_c("USD_EQ_XR ~ ", str_c("USD_FACT", 1:.x) %>% str_c(collapse = " + "))), tbm))
l.mods <- map(1:9, map_lm)  # creates list of linear regression models with incremental factor number
tbl.cmp <- purrr::map_df(l.mods, broom::glance, .id = "model")  %>% arrange(AIC)  # ranked tibble of model fits
print(tbl.cmp)
# A tibble: 9 x 13
  model r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
  <chr>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1 8        0.167         0.110   5.48      2.93 0.00508     8  -388.  797.  825.
2 2        0.0783        0.0633  5.62      5.23 0.00663     2  -395.  798.  809.
3 3        0.0890        0.0666  5.61      3.97 0.00965     3  -394.  798.  812.
4 9        0.169         0.105   5.50      2.63 0.00838     9  -388.  799.  830.
5 6        0.129         0.0846  5.56      2.92 0.0107      6  -391.  799.  821.
6 4        0.0991        0.0693  5.61      3.33 0.0127      4  -393.  799.  816.
7 5        0.106         0.0690  5.61      2.85 0.0180      5  -393.  800.  820.
8 7        0.129         0.0777  5.58      2.50 0.0196      7  -391.  801.  826.
9 1        0.0272        0.0194  5.75      3.47 0.0649      1  -398.  803.  811.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The lubridate package

General benefits

The lubridate package facilitates work and coding with data and date-time objects. In particular, it supports the parsing of date-times (i.e. converting strings or numbers into proper date-time objects), getting and setting components of date-time objects, rounding date-times and - in particular - mathematical operations with date-times, based on consistent timelines. This means operations are robust to time zones, leap days, daylight savings times, and other time-related idiosyncracies.

Example: Date-time maths with periods

Periods track how much the clock moves forward, regardless of what time actually passes. They ignore time line irregularities. One adds or subtracts periods in order to create records at specific clock times, irrespective of how much time has actually passed. Base R makes periods with functions that bear the neame of the time unit pluralized, such as years(x=1), months(x) or weeks(x). More generally lubridate::period(num, units) is an automation-friendly period constructor that can handle various types of periods.

In [16]:
l.units <- list(10, 20, 30)
l.freqs <- list("months", "weeks", "days")
per <- map2(l.units, l.freqs, ~period(.x, units = .y)) %>% reduce(`+`) %>% print()  # add up list of different period types
print(lubridate::ymd("2020-06-08") + 2 * per) # periods can be added to a date and get new date
[1] "10m 170d 0H 0M 0S"
[1] "2023-01-14"

Example: Date-time maths with durations

Durations track the passage of physical time, which deviates from clock time when irregularities occur. One usually adds or subtracts durations to model physical processes, like battery life, but they also matter for trading rule and the assessment of the impact of economic-financial shocks on price dynamics. Durations are stored as seconds, which is the only time unit with consistent length. The difftime objects in base R are a form of duration.

Lubridate makes periods with functions that bear the name of the period prefixed with a d, such as lubridate::dhours(), lubridate::ddays(), or lubridate::years(). All of these are considered to have a fixed number of seconds. There is no lubridate::dmonths(), because calendar months cannot be standardized.

In [17]:
print(ymd("2020-01-01") + years(1))  # adding one year period
print(ymd("2020-01-01") + dyears(1))  # adding one year duration (in leap year does not reach the next year)
[1] "2021-01-01"
[1] "2020-12-31 06:00:00 UTC"

Example: Date-time maths with intervals

Intervals represent clock time passage (like periods) on a particular part of the timeline. Unlike periods, intervals are defined by specific start and end times, rather than an amount of time. If summer time sets clocks forward an interval of one hour arises, even if no time has passed. Technically, intervals represent the spaces in boundaries.

Intervals can be created with the lubridate::interval(dt1, dt2) function or the dt1 %--% dt2 operator, both of which take date-time objects as arguments. Note that the interval between two units is one unit. For example, in the case of days the function counts from and to the same point within a day.

In [18]:
itv1 <- interval(ymd("2019-01-01"), ymd("2020-01-01"))
print(str(itv1))
itv2 <- dmy("1 Jan 2020") %--% dmy("1 Jan 2021")
print(itv2)
Formal class 'Interval' [package "lubridate"] with 3 slots
  ..@ .Data: num 31536000
  ..@ start: POSIXct[1:1], format: "2019-01-01"
  ..@ tzone: chr "UTC"
NULL
[1] 2020-01-01 UTC--2021-01-01 UTC

There is a range of useful operations that can be performed with intervals:

  • the dt %within% itv operator checks if a date (or interval) falls within an interval (useful for blacklisting);
  • lubridate::int_start(int) and lubridate::int_end(int) get and set first and last dates;
  • lubridate::int_length(int) gives the interval length in seconds;
  • lubridate::int_aligns(int1, int2), lubridate::int_overlaps(int1, int2) check if intervals have common boundary/overlap;
  • lubridate_shift(int, by) shift an interval along the timeline by an interval;
  • int_diff(times) turns the date-times in a vector into intervals;
In [19]:
itv <- interval(ymd("2019-01-01"), ymd("2020-01-01"))
ymd("2019-02-02") %within% itv  %>% print  # check if date falls into interval
interval(ymd("2019-09-01"), ymd("2020-05-01")) %within% itv  %>%  print  # envelopment needs to be complete to give TRUE
[1] TRUE
[1] FALSE

The stringr package

General benefits

The stringr package makes working with strings in R easier, particularly when this work involves regular expressions (regex). The package functions facilitate [1] detecting string pattern matches, [2] subsetting character vectors, [3] managing the length of strings, [4] changing strings by rules, [5] splitting and joining strings, and [6] ordering strings. This is a great benefit for workflows with financial and economic data structures, where time series are created in groups with references to their names across countries or markets.

The functions of the stringr package have the following consistent features:

  • All functions start with str_.
  • The first argument is a character vector.
  • Many functions have a pattern argument that accepts regex.

Example: Detecting matchings in character vectors

The function stringr::str_detect(cv, pattern) checks if the strings in the character vector cv contain a specific pattern. It returns a logical vector of the same length as the input character vector, with TRUE for elements that contain the pattern and FALSE otherwise. Analogously, stringr::str_which(cv, pattern) returns the indexes of strings that match the pattern. These functions are particularly powerful with regular expressions. However, if a regular expression is not needed, the stringr::fixed() function specifies that a pattern is a fixed string. This can yield substantial speed ups. Detecting pattern matches is highly useful in operating with data series in larger data sets by name.

In [20]:
str_which(names(tbm), "(^EUR|^USD).+XR$")  %>% print() # indexes of strings beginning with EUR/USD that contain XR at the end.
[1] 2 3

The tidymodels packages

General benefits

Tidymodels packages provide tidyverse-friendly interfaces for model implementation. They have taken the role of the tidyverse’s machine learning toolkit. The tidymodels packages do not implement statistical models themselves. Rather they focus on facilitating the surrounding tasks:

  • data sample splitting for validation (rsample),
  • data pre-processing (recipes),
  • standardized model training and out-of-sample predictions (parsnip),
  • creating and tuning hyperparameters(tune and dials),
  • model validation with appropriate metrics (yardstick), and
  • bundling processes (workflows).

These critical functions are usually performed in sequence and give an efficient standardized workflow.

Example: Data sample splitting

The rsample package is useful for both random and time (ordered) splits.

  • The function rsample::initial_split(data, prop = 3/4,...) creates a single binary random split of the data into a training set and testing set.
  • For time series rsample::initial_time_split(data, prop = 3/4, lag = 0, ...) does the same, but takes the first proportion of the sample for training, instead of a random selection.
  • Multi-fold cross-validation randomly splits the data into v groups of roughly equal size (called “folds”). The rsample::vfold_cv() function does such random splitting.
  • For time series rsample::rolling_origin(data, initial = 5, assess = 1, cumulative = TRUE, skip = 0, lag = 0,...) resamples consecutively across time.

All of the above give rset objects that denote splits. The splits can be used after being passed as arguments to rsample::training() and rsample::testing() functions. The functions extract actual training and testing samples.

A popular standard of data partioning is an initial split into (broad) training and test sets and a - subsequent - split of the broad training set into various (core) training and validation folds. The latter typically serves model tuning, i.e. the selection of the model with the best tuning parameters.

In [21]:
tbm.usd <- tbm %>% select(starts_with("USD"))
rset.in <- rsample::initial_split(tbm.usd, prop = .7)  # initial 70%/30% split object
tbl.trn <- rsample::training(rset.in)  # broad training set
tbl.tst <- rsample::testing(rset.in)  # test set
rset.cv <- rsample::vfold_cv(tbl.trn)  # cross-validation folds of the broad training set
rset.cv %>% print()  # for illustration
#  10-fold cross-validation 
# A tibble: 10 x 2
   splits         id    
   <list>         <chr> 
 1 <split [80/9]> Fold01
 2 <split [80/9]> Fold02
 3 <split [80/9]> Fold03
 4 <split [80/9]> Fold04
 5 <split [80/9]> Fold05
 6 <split [80/9]> Fold06
 7 <split [80/9]> Fold07
 8 <split [80/9]> Fold08
 9 <split [80/9]> Fold09
10 <split [81/8]> Fold10

Example: Formula-specific data pre-processing

The recipes package provides an interface that facilitates data pre-processing for a specific estimation formula. It specifies the role of each variable as an outcome or predictor variable. And, for intuition, the package functions are named after cooking actions.

The function recipe(data, form,...) creates a recipe object based principally on the model formula form. The initial recipe object becomes the basis for subsequent transformations. Each data transformation is called a step. The package offers functions that correspond to specific types of steps, such as normalization or elimination of highly correlated variables, each prefixed with step_. There are nearly one hundred step functions available. Information on the scope of available step functions can be found here.

A step can be applied to a specific variable, groups of variables, or all variables. For the purpose of selection, the recipes package offers special selector functions. Thus, has_role(), all_predictors(), and all_outcomes() select variables according to their role in the estimation formula. Similarly, has_type(), all_numeric(), and all_nominal() are used to select columns based on data type. Moreover, select helpers from the tidyselect package can be used, just as in dplyr::select(), including starts_with(), contains() and so forth.

After specifying the transformations with step functions, prep() executes the transformations on top of the data that is supplied. This produces a modified recipe whose step objects have been updated with the required quantities.

In [22]:
rec.usd <- recipes::recipe(USD_EQ_XR ~ ., data = tbl.trn) %>%
      recipes::step_corr(all_predictors(), threshold = 0.9) %>%  # correlation filter on predictors
      recipes::step_center(all_predictors()) %>%  # centering
      recipes::step_scale(all_predictors()) %>%  # scaling to SD 1
print(rec.usd)
Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor          9

Operations:

Correlation filter on all_predictors()
Centering for all_predictors()
Scaling for all_predictors()

After a recipe object has been created for a training data set, testing data can be transformed in exactly the same way but out of sample by using the bake() function. For a recipe with at least one pre-processing operation that has been trained by prep.recipe() computations are applied to the new data.

Example: Standardized model training, validation and testing

The parsnip package focuses on standardizing model interfaces and return values. When using parsnip, one does not have to remember each interface and unique argument names. This facilitates coding across R packages. Information on the scope of supported models can be found here.

As model type that typically requires tunining is elastic net based on the glmnet package. In a tidyverse workflow one calls such a model through the parsnip interface, while leaving open the tuning parameters, i.e the penalty for regressors and the weight of LASSO versus ridge regression.

In [23]:
mod.elnet <- parsnip::linear_reg(penalty = tune::tune(), mixture = tune::tune()) %>%
      parsnip::set_engine("glmnet")
print(mod.elnet)
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune::tune()
  mixture = tune::tune()

Computational engine: glmnet 

A workflow is an object that can bundle together pre-processing, modeling, and post-processing. For example, a recipe object and a parsnip model can be combined into a workflow.

In [24]:
wfl.elnet <- workflows::workflow() %>%
      workflows::add_recipe(rec.usd) %>%  # add formula-based recipe to workflow
      workflows::add_model(mod.elnet)  # add general model type to workflow

The goal of the tune package is to facilitate hyperparameter tuning for the tidymodels packages. It relies heavily on recipes, parsnip, and dials. Model parameter tuning is accomplished by training models with different specification and testing the predictive success. By running a "grid" (various combinations) of parameter values we can build hundreds or thousands of models. Validation will then identify the "best" model, i.e. the model with optimized tuning parameters.

Here we use the tune::tune_grid() function based on the pre-set workflow and the cross-validation folds to obtain tuning results for a grid that is being created (here by an integer). The chosen metric for evaluation here is the root mean-squared error.

In [25]:
trs.elnet <- wfl.elnet %>%  # create tuning result object
    tune_grid(resamples = rset.cv, grid = 10, metrics = yardstick::metric_set(yardstick::rmse))
trs.elnet %>% tune::collect_metrics() %>% print()  # view summary of tuning results.
# A tibble: 10 x 8
    penalty mixture .metric .estimator  mean     n std_err .config
      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
 1 1.47e- 4  0.0910 rmse    standard    5.99    10   0.516 Model01
 2 2.70e-10  0.146  rmse    standard    5.99    10   0.516 Model02
 3 1.50e- 5  0.262  rmse    standard    5.99    10   0.516 Model03
 4 2.55e- 1  0.353  rmse    standard    5.93    10   0.534 Model04
 5 3.21e- 8  0.497  rmse    standard    5.99    10   0.516 Model05
 6 1.24e- 3  0.609  rmse    standard    5.99    10   0.516 Model06
 7 3.37e- 6  0.630  rmse    standard    5.99    10   0.516 Model07
 8 1.56e- 9  0.764  rmse    standard    5.99    10   0.516 Model08
 9 6.73e- 2  0.826  rmse    standard    5.97    10   0.525 Model09
10 4.56e- 7  0.966  rmse    standard    5.99    10   0.516 Model10

A tibble with the optimized parameters can be extracted trough tune::select_best().

In [26]:
tbl.ops <- trs.elnet %>% tune::select_best(metric = "rmse") # select best specification
print(tbl.ops)  # tibble with optimized parameters
# A tibble: 1 x 3
  penalty mixture .config
    <dbl>   <dbl> <chr>  
1   0.255   0.353 Model04

With tune::finalize_workflow() and parsnip::fit() one can then refit the the optimally tuned model based on the whole broad training sample.

In [27]:
wfl.reg <- wfl.elnet %>%  
  tune::finalize_workflow(tbl.ops) %>%  # attach the best tuning parameters to the model
  parsnip::fit(data = tbl.trn)  # fit the final model to the training data

The refitted optimizal model can then be applied to the test set, allowing a validation of the overall learning method out of sample.

The tidymodels package yardstick provides tidy characterizations of model performance. The yardstick::metrics(data, truth, estimate,...) function produces common performance measures of a model. It will automatically choose metrics appropriate for the given type of model. The function expects a tibble with columns that contains the actual results (truth) and what the model predicted (estimate).

In [28]:
wfl.reg %>%
  predict(new_data = tbl.tst) %>%  # predict test set
  bind_cols(tbl.tst, .) %>%  # combine with actual 
  select(USD_EQ_XR, .pred) %>% 
  yardstick::metrics(USD_EQ_XR, .pred) %>%  # validate
  print()
# A tibble: 3 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      4.96  
2 rsq     standard      0.0966
3 mae     standard      3.90