Ten Reasons Out of Sample Performance May Be Worse Than In Sample Trading System Performance

Can backtesting results generalize in forward tests or out of sample tests?

Why do trading systems tend to perform worse on new data? In researching the topic I created a list of 10 possible reasons why out of sample testing may be much worse than in sample testing and provide some ideas on how to overcome these potential obstacles. The two pictures below show equity curves with in and out of sample periods divided by a red vertical line. Note how the equity curve on the left has much worse performance out of sample than the equity curve on the right. What is the difference?

Example of In Sample and Out of Sample Test that is unprofitable.
Example of In Sample and Out of Sample Test that is profitable.

Definitions: In sample refers to the data that is set aside for thorough testing. In sample testing includes optimization of parameters and should be used to perform all tests preparatory to running the system live or running an out of sample test for validation purposes.

Out of sample is the data that is set aside separate from the in sample data, and is used to validate that the in sample tests produced valid results. Out of sample data should not be used for optimization or testing other than a simple validation at the end of the process. Out of sample data is new data that has never been used for testing the system and as a result replicates what you would get if you ran the test forward over that untested period, as in a forward test.

    1. Too much optimization of in sample data. Over-optimization of the in sample period risks a higher chance that by luck alone, a superior test will be found. The more parameters that are tested in the finer granularity, the greater the chance that the in sample results will be over fit to only the in sample data and will not be generalizable to any other data.
    2. The data has a regime change right at the end of the in sample test period, that causes the out of sample data to behave differently. An instance of a regime change might be a marked change in volatility that occurs just at the end of the in sample test period that causes the system to lose profitability.
    3. Filters that may improve in sample system performance may paradoxically decrease performance out of sample. One way to test if this is the case is to systematically remove the filters one at a time to determine if the filters are too specific to the current time and not generalizable enough to other times. To test this properly you will need an additional test period that can be used to validate each system change while preserving the out of sample data for the final validation.
    4. One of the mathematical calculations is implemented incorrectly. If you use mathematical function such as standard deviation, correlation, etc., take care to ensure that the calculations are performed correctly by checking against a known source of good calculations for your specific math function.
    5. Due to a large number of tests, (good) results were obtained by luck. This is similar to #1. If you torture the data long enough it will eventually relent and provide you with a superior test result. Unfortunately, most of the time this rosy in sample result cannot be replicated out of sample. It's more art than science to determine how much testing is appropriate. You will have to experiment to determine how long / how many iterations to run for your tests for best results.
    6. The data being used in the test is too inconsistent to base a systematic strategy upon. This can happen if you are using low quality data as an input to your system, or if you are asking more from your data than what it is able to produce.
    1. Your testing method is introducing look ahead of data snooping bias. Data snooping bias is where you use information about the future in your system development or testing methodology. It is really easy to introduce data snooping bias into your testing if you perform manual backtests. This is one reason I feel it is better to do automated backtesting. Take some time to ensure that no data from your out of sample period is included in the in sample period.
    2. Your testing methodology is introducing selection bias. Selection bias may be introduced in the in sample testing phase where you select parameters based on best performance. Generally speaking, parameters should be selected based on subsets of the in sample period, rather than based on the entire in sample period. This is a very difficult bias to overcome, particularly with traditional backtesting software. I may post a future article on this topic.
    3. The optimization in the in sample period favors highly profitable outcomes. As a result, your loss taking strategy may not have a chance to be fully tested. An example of this is using a very small profit target with a larger stop loss. Based on optimization, you may think that the stop criterion rarely triggers, but when new data with different volatility characteristics is used in the out of sample period, your system ends up with a string of stop losses. Because the stop loss is large, it is rarely triggered because your optimization process will naturally select systems that have low stop out frequency. But if conditions line up on the unseen data such that several stop losses occur in sequence, you may find that your loss exit parameters are not as optimal as your testing would indicate. To overcome this problem, reduce the range of values allowed for adverse loss exits to get a better feel for how taking regular losses during the test period will do.
    4. The data has holes in it or is missing data that may cause unpredictable signals to be triggered during the test. Always spot check the data before testing, looking for periods where data is missing. During my last round of testing I found that there was a week missing for EURUSD that was causing unreliable results in the testing.
Backtesting Performance Gets Better or Worse in out of sample period?