Standard performance statistics are insufficient and potentially misleading for evaluating algorithmic trading strategies. Metrics based on prediction errors mistakenly assume that all errors matter equally. Metrics based on classification accuracy disregard the magnitudes of errors. And traditional performance ratios, such as Sharpe, Sortino and Calmar are affected by factors outside the algorithm, such as asset class performance, and rely on the normal distribution of returns. Therefore, a new paper proposes a discriminant ratio (‘D-ratio’) that measures an algorithm’s success in improving risk-adjusted returns versus a related buy-and-hold portfolio. Roughly speaking, the metric divides annual return by a value-at-risk metric that does not rely on normality and then divides it by a similar ratio for the buy-and-hold portfolio. The metric can be decomposed into the contributions of return enhancement and risk reduction.

Dessain, Jean, “Machine learning models predicting returns: why most popular performance metrics are misleading and proposal for an efficient metric.” 

The below quotes are from the paper. Headings, cursive text, and text in brackets have been added.

This post ties in with this site’s summary on quantitative methods for macro information efficiency, particularly the section on backtesting.

Popular algorithm performance metrics

“Artificial intelligence (AI) and its sub-fields of machine learning (ML), deep-learning (DL) and reinforcement learning (RL) have proven to be an attractive framework to…to predict asset returns…Our aim is to investigate…performance criteria used to evaluate AI models and to assess their efficiency for comparing these models with the pursued objective to improve the risk/return profile of investment strategies.”

“We reviewed 190 articles presenting either several ML and DL algorithms aiming at predicting future asset returns or RL algorithms proposing investment strategies. The performance metrics found in the analysed articles are very diverse…

  • Error-based metrics estimate the performance of an algorithm in measuring the error in prediction between the effective return computed ex-post and the value predicted by the algorithm. These metrics include mean squared error (MSE), mean absolute error (MAE) and evolutions thereof…
  • Accuracy-based metrics measure the accuracy of the class assigned by the algorithm to the predicted return compared to the class of the effective return computed ex-post. The classification can be binary with two classes (positive expected return vs negative expected return, or investment vs no investment) or more complex…These metrics are based on confusion matrices…and include…accuracy, F1, precision or recall.
    N.B.: Accuracy is the fraction of correctly classified samples. In the binary case it is the ratio of true positives and true negatives to all labelled cases, where true means correctly labelled by the model. Precision is the ratio of true positives, i.e. correctly labelled positives, to all cases that are labelled as positives. This means that the denominator is the sum of true positives and false positives. It is a metric of avoidance of mistakes. Recall is the ratio of true positives to all cases that are classified as positives. This means that the denominator is the sum of true positives and false negatives. For a good blend of precision and recall, we can combine the two metrics to the F1 score. It is a metric of avoidance of missing out. The F1 score is the harmonic mean of precision and recall values, taking both metrics into account. It is precision times recall divided by the average of precision and recall. The harmonic mean punishes extreme values.
  • Investment-based metrics measure the results derived from an investment strategy proposed by the algorithm with buy-hold-sell signals. These metrics can be subdivided into [two types].
    • Result-based metrics measure either the monetary results, the realized return or the risk supported to generate the return (volatility, maximum drawdown, etc.) but do not adjust one by the other.
    • Risk-adjusted return-based metrics…also referred to as risk/return-based metrics consider simultaneously the return and the risk of the investment strategy and measure how efficient the algorithm is to generate a return under the constraint of risk and to optimize the risk/return profile. Metrics primarily differ by the way they assess the risk. This class of metrics includes Sharpe, Sortino or Calmar ratios…”

Why popular performance metrics are misleading

“Error-based metrics are among the most popular ones with 187 occurrences in the 190 reviewed articles. Error-based metrics are used in any domain as soon as regressions are involved, but for the specific task considered, error-based metrics suffer from two severe weaknesses:

  • While they can easily be applied to regression algorithms, they are less applicable with classification algorithms and are inapplicable with reinforcement learning, making a comparison between several types of algorithms impossible
  • Error-based metrics will equally consider all errors and will not differentiate an error that triggers a bad decision (a mis-investment resulting in a negative return or a missed opportunity with no investment when the asset has led to a positive return) from an error that has no adverse consequence, leading to a positive return or to a non-investment that avoided a negative return…All errors are not equal; error-based algorithms do miss this critical element. Error-based metrics could lead to severe misevaluation of the performance of algorithms.”

“Accuracy-based metrics…focus on a different criterion: the right or wrong classification or the right or wrong investment decision. But accuracy-based metrics might miss the magnitude of the relative gain from a good decision versus the magnitude of a loss from a bad decision.”

Insights from an empirical exercise

“We prove the inefficiency of the error-based and accuracy-based metrics …We apply several AI regression algorithms: (i) multi-layer perceptron (MLP), (ii) Long Short-Term Memory neural networks (LSTM), (iii) residual neural networks (ResNet), (iv) Support Vector Machine (SVM) and (v) a decision tree-based algorithm “eXtreme Gradient Boosting” (XGB) to 28 stocks of the Dow Jones. We use different hyper-parameters with each algorithm to generate 980 series of daily returns. We use 20 years history of daily prices: 15 years are used to train our algorithms and 5 years (1260 days) for testing as out-of-sample data.”

“We compute the MSE, RMSE, MAE (mean absolute error) and MAPE (mean absolute percentage error) of the regressions. We benchmark each of the 980 series with the ‘back-trading’ of a perfectly informed agent that invests when the return is positive and doesn’t invest when the return is negative or zero. We compute R, R², accuracy, F1, precision & recall and Matthew’s correlation coefficient.”

We apply the following investment strategy: if the predicted return of the next day is positive, we invest for one day, otherwise we take no open position. In each case, the model integrates direct transactions costs13 of 0.10% per transaction applied to the value of the transaction. From that investment strategy and assuming a risk-free rate at 0.0%, we compute the annual return (RoI), the volatility (Vol), the yearly maximum drawdown (MDD) in percentage of the investment and the Sharpe, Sortino and Calmar ratios.

With the error-based metrics, we expect a negative correlation with the return, Sharpe, Sortino and Calmar ratios: the lower the error, the better the expected result. In italic, the metrics that are positively correlated. Against expectations for efficient metrics, correlations disclosed in [the table below] are positive, except between MAPE and the risk/return performance metrics, but not significantly different from 0 at 5% significance, as illustrated with the p-values. MAPE is the only metric whose correlation is negative and significantly so.”

“Efficient accuracy-based metrics should have positive and significant correlations with the return, Sharpe, Sortino and Calmar ratios and the accuracy-based metrics, that is higher accuracy. In italic, the negative correlations…Accuracy, F1, precision, recall…are positively correlated with the return and with Sharpe, Sortino and Calmar ratios. Accuracy and F1 have the highest correlation.

The issues with Sharpe and Sortino ratios

Sharpe and Sortino ratios suffer from two important issues…

  • The Sharpe ratio quantifies risk using the standard deviation of excess returns, and Sortino by using the standard deviation of the negative excess returns. They assume that returns are normally distributed, with no skewness and a kurtosis around 3. If a portfolio’s return does not follow a Gaussian distribution, then the classical return volatility is no longer an effective measure of risk, and these ratios could underestimate the risk…
  • [Sharpe and Sortino ratios] do not allow the performance of different algorithms to be compared over different assets or over different time periods. The results of Sharpe and Sortino are influenced by the return of the underlying asset.

Proposal for an algorithm performance metrics

“The objective of…[trading] algorithms…is to optimize the expected return of investments under the constraint of the risks generated by the investment. Our analysis will therefore focus on the ability of metrics to provide a good proxy for the ability of an algorithm to achieve the objective of improving the risk-adjusted return.”

“We propose a new performance metric that improves the risk measurement and which has the ability to compare the efficiency of algorithms over time and across assets.

  • The use of Cornish Fisher Value-at-risk significantly improves the measure of risk, compared to the volatility of returns…Value-at-risk (VaR)…offers a way to address skewness and kurtosis of the asset returns distribution with Cornish Fisher expansion (CF expansion). CF expansion accounts for the 4 moments of the distribution: the return, the volatility, the skewness and the kurtosis. It offers an easily implementable parametric form that improves risk measurement. Cornish-Fisher VaR (CF-VaR) is an effective and easy-to-implement approach to dealing with non- Gaussian distributions…There is no easy and parametric way to improve the risk measurement [beyond] CF-VaR.
  • If we combine the asset return with the CF-VaR, we can easily define a return-to-VaR ratio equal to return divided by CF-VaR. This ratio outperforms Sharpe, Sortino or Calmar ratios as it better captures the effective risk accepted to generate the effective return.
  • We propose to define a new risk-adjusted return ratio as ‘Discriminant ratio’ or ‘D-ratio’ (D), which solely focuses on the added value of the algorithm. To achieve this objective, we divide the Return-to-VaR ratio of the algorithm by the Return-to-VaR ratio of the Buy & Hold If the D-ratio is greater than 1, the algorithm overperforms the Buy & Hold strategy; if the D-ratio is smaller than 1, it underperforms the buy& hold strategy.
  • To adequately address the situation where the return of the buy & hold and the return of the algorithm are of opposite sign…We propose to correct our D-ratio for the difference in sign by [subtracting the risk-adjusted return ratio of buy and hold strategy from the algo strategy and dividing the difference by the absolute value of the risk-adjusted return ratio of buy and hold strategy].

The overall formula is:

D-ratio = 1 + (R[algo] – R[B&H]) / Abs(R[B&H])

where

R[algo] is the tisk-adjusted return ratio of algo
R[B&H] is the risk-adjusted return ration of buy and hold

The D-ratio can be decomposed to assess whether the added value of the algorithm is more linked to the improved expected return or to the risk reduction ability.

D-ratio = D-return ratio * D-VaR ratio

where

D-return ratio = D-ratio / D-VaR ratio
D-VaR ratio = CF-VaR[B&H] / CF-VaR[algo]

[The] D-return ratio evaluates the ability of the algorithm to increase the expected return. If D-return is above 1, the algorithm outperforms the buy & hold strategy for its expected return. Otherwise, the Buy & Hold strategy is return-wise more efficient than the algorithm.

If D-VaR is above 1, the algorithms outperforms the buy & hold strategy for its risk management, as the CF-VaR of the Buy & Hold is greater than the CF-VaR of the algorithm.

SHARE
Previous articleThe emotion beta of stocks
Next articleEquity factor timing with macro trends
Ralph Sueppel is founder and director of SRSV, a project dedicated to socially responsible macro trading strategies. He has worked in economics and finance for over 25 years for investment banks, the European Central Bank and leading hedge funds. At present, he is head of research and quantitative strategies at Macrosynergy Partners.