The business of algorithmic trading strategies creates incentives for model overfitting and backtest embellishment: researchers must pass Sharpe ratio thresholds for their strategies to be considered, while managers lack interest in realistic simulations of ideas. Overfitting leads to bad investment decisions and underestimated risk. Sound ethical principles are the best method for containing this risk (view post here). Where these principles have been compromised one should assume that all ‘tweaks’ to the original strategy idea that are motivated by backtest improvement contribute to overfitting. Their effects can be estimated by looking at the ratio of directional ‘flips’ in the trading signal. Overfitting typically increases with the Sharpe ratio targets of the business and the scope for applying ‘tweaks’ to the strategy. Realistic strategy performance expectations should be based on a range of plausible strategy versions, not on an optimized one.

Rej, Adam, Philip Seager and Jean-Philippe Bouchaud (2019), “How should you discount your backtest PnL?”

This post ties in with SRSV’s summary lecture on macro information efficiency, particularly the section on data science.
The below are excerpts from the paper. Emphasis and cursive text have been added. Formulas and mathematical symbols have been paraphrased.

The overfitting problem

Any strategy, whether of discretionary or systematic nature that is appraised using a backtest is at risk of being overfitted…Part of its performance is due to favorable alignment of market forces [i.e. market returns have been aligned with strategy signals for reasons other than a stable structural or causal relation]. This windfall performance is of course not to be counted upon in the future. In the language of stochastic processes, favorable or unfavorable market conditions simply represent pure noise. Taking favorable noise realizations at their face values is thus an important source of overfitting.”

Why overfitting is the norm

“Investment research teams sift through historical market data, in the hope of discovering recurring patterns that could be monetized…A new strategy should increase the diversification and the expected return of the portfolio. For a strategy uncorrelated with existing portfolio strategies, these requirements will translate into setting a Sharpe ratio threshold that the strategy at hand needs to clear… This is one of the inherent sources of overfitting, as [the researcher] will only pitch strategies such that the backtested Sharpe ratio is above the threshold.”

If the research team has reasons to believe that their strategy is sound, but the backtest P&L does not meet their expectations, they will most likely not discard it… The research team will propose ‘improvements’ to these building blocks and one or more series of improvements will result in an acceptable backtest performance. Did this procedure truly improve the strategy? In most cases the answer is no. The performance enhancement simply comes from ‘improving’ the noise realization of the original strategy. Even if the improvement is genuine, there is no way of knowing.”

“If the performance is below the required one, the researcher will try to improve the strategy. These ‘tweaks’ typically consist of slight modifications of the strategy, such as replacing the filter with a similar one, changing some parameters, removing certain asset types, etc. The researcher usually has a reasonably sounding narrative to justify these. Of course, only modifications leading to improvement of the in-sample performance are retained.”

A framework for adjusting for overfitting

“The true Sharpe ratio of [a] strategy remains unbeknownst to the researcher. The best she can do is to calculate the estimated Sharpe ratio. It is well known that the estimation of Sharpe ratios is subject to considerable errors because it is impossible to separate the drift term [genuine value generation] from the realization of noise…Sharpe ratios computed using a finite number of data points are approximately normally distributed [around their true value]. “

“We will assume that every modification [that tweaks a strategy subsequent to an initial backtest] deteriorates the out-of-sample performance [relative to that initial backtest]….[In particular,] we shall assume that every modification (tweak) to the strategy translates into flipping predictor signs on a subset of the sample. The parameter [that measures the share of flipped signs] essentially captures the researcher’s overfitting prowess…We shall assume that if the original realization of the strategy does not clear the threshold, the research team will continue improving it and will stop as soon as the Sharpe ratio is above the threshold. While tampering with the parameters of the strategy is probably well modelled by flipping signs…We illustrate this process in [the figure below].”

“The parameter [that measures the share of flipped signs] thus captures how much the researcher is ready to depart from the original P&L. We expect that typically, a researcher would not want the ‘improved’ strategy to be less than 80% correlated with the original proposal, which would translate into the upper bound of the [sign-flipping] parameter of 10%.”

“The overfitting factor…is the ratio of the expected in-sample Sharpe ratio and the expected out-of-sample Sharpe…It measures how much overfitting should be expected on average if the researcher’s behavior we assumed is representative…We find that for typical Sharpe ratios of CTA strategies (0.3-0.5) and for reasonable values of other parameters (5% of signs flipped through tweaking, target thresholds of Sharpes of 0.7) the discounting [or overfitting] factor is 2, which is in line with our experience and seems to be the industry standard.”

The upshot of this theory is that any tweaks to an original strategy idea that are motivated by an improvement of backtest performance contribute to overfitting bias. If the research process is transparent, an indication of the bias is the ‘sign flipping ratio’ resulting from the tweaks. If the research process is intransparent we must form a judgment on how much tweaking has plausibly been applied, possibly by asking for statistics on other ‘untweaked’ versions of the strategy.
This does not mean that past performance cannot be a guide for strategy design, but only that improvements that arise from this ‘tweaking’ should not affect our judgment on out-of-sample performance.

Why long backtests do not prevent overfitting

“The quality of the estimate may be increased only by increasing the length of the backtest. In practice, however, for many asset types backtests are limited to (at most) a couple of decades of daily data…For a Sharpe ratio of 0.5 one needs 43 years of backtest data in order to be 99:9% confident that the performance is significantly different from noise….In practice, an asset manager would appraise the residual performance with respect to existing strategies. Residual Sharpe ratios on the order of 0.3-0:5 are commonplace in the CTA [commodity trading advisors, i.e. systematic investment funds] space.”

“For small values of the [sign-flipping] parameter at least, the overfitting factor diminishes in value for longer backtests [but] converges [towards a finite positive value that is proportionate to targeted threshold Sharpe ratio] for large values of the sample size. This is because the probability that the original realization will cross the threshold drops with increasing sample size and consequently the probability of the researcher’s intervention increases accordingly…[Morevover, ] since effectively a fraction of the backtest is used for overfitting, increasing the length of the backtest also increases the overfitting freedom and the conditional expectation does not vary much.”

“The more overfitting freedom, the higher the level of overfitting. Lower Sharpe ratio strategies are more strongly impacted than the higher Sharpe strategies.”