Definitions: In sample refers to the data that is set aside for thorough testing. In sample testing includes optimization of parameters and should be used to perform all tests preparatory to running the system live or running an out of sample test for validation purposes.
Out of sample is the data that is set aside separate from the in sample data, and is used to validate that the in sample tests produced valid results. Out of sample data should not be used for optimization or testing other than a simple validation at the end of the process. Out of sample data is new data that has never been used for testing the system and as a result replicates what you would get if you ran the test forward over that untested period, as in a forward test.
- Too much optimization of in sample data. Over-optimization of the in sample period risks a higher chance that by luck alone, a superior test will be found. The more parameters that are tested in the finer granularity, the greater the chance that the in sample results will be over fit to only the in sample data and will not be generalizable to any other data.
- The data has a regime change right at the end of the in sample test period, that causes the out of sample data to behave differently. An instance of a regime change might be a marked change in volatility that occurs just at the end of the in sample test period that causes the system to lose profitability.
- Filters that may improve in sample system performance may paradoxically decrease performance out of sample. One way to test if this is the case is to systematically remove the filters one at a time to determine if the filters are too specific to the current time and not generalizable enough to other times. To test this properly you will need an additional test period that can be used to validate each system change while preserving the out of sample data for the final validation.
- One of the mathematical calculations is implemented incorrectly. If you use mathematical function such as standard deviation, correlation, etc., take care to ensure that the calculations are performed correctly by checking against a known source of good calculations for your specific math function.
- Due to a large number of tests, (good) results were obtained by luck. This is similar to #1. If you torture the data long enough it will eventually relent and provide you with a superior test result. Unfortunately, most of the time this rosy in sample result cannot be replicated out of sample. It's more art than science to determine how much testing is appropriate. You will have to experiment to determine how long / how many iterations to run for your tests for best results.
- The data being used in the test is too inconsistent to base a systematic strategy upon. This can happen if you are using low quality data as an input to your system, or if you are asking more from your data than what it is able to produce.