Can backtesting results generalize in forward tests or out of sample tests?
Why do trading systems tend to perform
worse on new data? In researching the topic I created a list of 10
possible reasons why out of sample testing may be much worse than in
sample testing and provide some ideas on how to overcome these
potential obstacles. The two pictures below show equity curves with in and out of sample periods divided by a red vertical line. Note how the equity curve on the left has much worse performance out of sample than the equity curve on the right. What is the difference?
Definitions: In sample refers to the
data that is set aside for thorough testing. In sample testing
includes optimization of parameters and should be used to perform all
tests preparatory to running the system live or running an out of
sample test for validation purposes.
Out of sample is the data that is set
aside separate from the in sample data, and is used to validate that the in
sample tests produced valid results. Out of sample data should not be
used for optimization or testing other than a simple validation at
the end of the process. Out of sample data is new data that has never
been used for testing the system and as a result replicates what you
would get if you ran the test forward over that untested period, as
in a forward test.
Too much optimization of in sample
data. Over-optimization of the in sample period risks a higher
chance that by luck alone, a superior test will be found. The more
parameters that are tested in the finer granularity, the greater the
chance that the in sample results will be over fit to only the in
sample data and will not be generalizable to any other data.
The data has a regime change right
at the end of the in sample test period, that causes the out of
sample data to behave differently. An instance of a regime change
might be a marked change in volatility that occurs just at the end
of the in sample test period that causes the system to lose
Filters that may improve in sample
system performance may paradoxically decrease performance out of
sample. One way to test if this is the case is to systematically
remove the filters one at a time to determine if the filters are too
specific to the current time and not generalizable enough to other
times. To test this properly you will need an additional test period
that can be used to validate each system change while preserving the
out of sample data for the final validation.
One of the mathematical
calculations is implemented incorrectly. If you use mathematical
function such as standard deviation, correlation, etc., take care to
ensure that the calculations are performed correctly by checking
against a known source of good calculations for your specific math
Due to a large number of tests,
(good) results were obtained by luck. This is similar to #1. If you
torture the data long enough it will eventually relent and provide
you with a superior test result. Unfortunately, most of the time
this rosy in sample result cannot be replicated out of sample. It's
more art than science to determine how much testing is appropriate.
You will have to experiment to determine how long / how many
iterations to run for your tests for best results.
The data being used in the test is
too inconsistent to base a systematic strategy upon. This can happen
if you are using low quality data as an input to your system, or if
you are asking more from your data than what it is able to produce.
Your testing method is introducing
look ahead of data snooping bias. Data snooping bias is where you
use information about the future in your system development or
testing methodology. It is really easy to introduce data snooping
bias into your testing if you perform manual backtests. This is one
reason I feel it is better to do automated backtesting. Take some
time to ensure that no data from your out of sample period is
included in the in sample period.
Your testing methodology is
introducing selection bias. Selection bias may be introduced in the
in sample testing phase where you select parameters based on best
performance. Generally speaking, parameters should be selected based
on subsets of the in sample period, rather than based on the entire
in sample period. This is a very difficult bias to overcome,
particularly with traditional backtesting software. I may post a
future article on this topic.
The optimization in the in sample
period favors highly profitable outcomes. As a result, your loss
taking strategy may not have a chance to be fully tested. An example
of this is using a very small profit target with a larger stop loss.
Based on optimization, you may think that the stop criterion rarely
triggers, but when new data with different volatility
characteristics is used in the out of sample period, your system
ends up with a string of stop losses. Because the stop loss is
large, it is rarely triggered because your optimization process will
naturally select systems that have low stop out frequency. But if
conditions line up on the unseen data such that several stop losses
occur in sequence, you may find that your loss exit parameters are
not as optimal as your testing would indicate. To overcome this
problem, reduce the range of values allowed for adverse loss exits
to get a better feel for how taking regular losses during the test
period will do.
The data has holes in it or is
missing data that may cause unpredictable signals to be triggered
during the test. Always spot check the data before testing, looking
for periods where data is missing. During my last round of testing I
found that there was a week missing for EURUSD that was causing
unreliable results in the testing.