Traders eager to begin trading in a live market frequently make the mistake of relying exclusively on backtesting results to evaluate a system’s potential. Backtesting, which refers to the testing of a trading idea on historical data to verify how a system would have performed during a particular time period, can produce misleading results. It’s important to have a more complete approach to trading system evaluation.
Because backtesting is only part of a proper evaluation process, focusing on backtesting results alone can lead a trader to believe he or she has a rock-star trading system when, in fact, the system may perform poorly in other phases of testing and, eventually, during live trading. Finding positive correlation between backtesting results and other phases of testing, including out-of-sample and forward performance testing, is vital in accurately assessing the viability of a trading system.
Backtesting basics
Backtesting allows traders to apply trading ideas to historical data to see how the system would have performed. Many of today’s trading platforms offer the ability to backtest, and provide efficient and easy-to-use methods of testing ideas on past market data.
Without putting real cash on the line, traders can evaluate the effectiveness of a trading idea with a few simple keystrokes. As long as an idea can be quantified, it can be backtested — from simple moving average crossovers, to complex systems that incorporate multiple trade filters and triggers. "Trading the Russell" (below) shows a strategy that is being tested on the mini Russell 2000 contract.

Some trading platforms have strategy "wizards" or "builders" that allow analysts to select from a field of variables to create a custom strategy. Traders can use these strategy building tools, write their own code (typically using the platform’s proprietary language), or work with a qualified programmer to develop a trading idea into a testable form. Frequently, a trading system will incorporate user-defined input variables, such as the moving average length or the number of standard deviations, which allow the trader to tweak — or make small changes — to the system. These subtle modifications often can lead to dramatic changes in backtesting results.
Optimization studies are another feature that many trading platforms offer in conjunction with backtesting capabilities. Optimization entails entering a range for a specified input — a moving average length, for example — and letting the computer perform the calculations to determine the input that has the best performance. For example, you can optimize a strategy to find the best profit target. You can set the optimization study to test values between $200 and $600 in $20 increments.
A multi-variable optimization analyzes two or more variables in conjunction to establish what combination leads to the most favorable results. A multi-variable optimization could determine, for example, which moving average length and relative strength index (RSI) level, when combined, would yield the most favorable results.
Curve-fitting
While it can be exciting to optimize a system and watch the theoretical results improve with a few simple tweaks, over-optimization, or curve-fitting or over-fitting, is a threat to a trading system’s success in actual trading.
Curve-fitting is the excessive use of optimization to create the most profitable system possible based on the historical data. It could involve the use of too many technical indicators, limiting the days of the week that a system will take trades, restricting the times of day the system can trigger a trade and using other filters and triggers that create the best possible scenario on a given data set by eliminating all or most of the losing trades.
The problem with curve-fitting is that virtually any system can be optimized to show close to 100% profitability (the percentage of winning trades) — on historical data. That’s because traders can manipulate the system to take advantage of every price change, creating a system that is custom-designed only for that particular time period.
Often, this leads to a limiting system that performs exceptionally well on the selected historical data, that quickly turns into a train wreck on any other data set, including live trading. "Reality bites" (below) shows the equity curve of a strategy that performed well during backtesting, but once forward testing began the results were mediocre, indicating that the strategy was likely curve-fit.

Traders can avoid curve-fitting by paying attention to key performance metrics, such as percent profitable and average trade net profit, and aiming for realistic thresholds. Many profitable trading systems, for instance, are 40% to 60% profitable — nowhere near the 100% that is so enticing to some traders. All trading systems will have losing trades. A high percent profitable number probably indicates that the system has been over-optimized.
The rule on optimization is to use broad logical principles. If you find eliminating long entries on Tuesdays improves your performance, that is probably coincidental regardless of how it improves your testing results. You also must save clean data to test after you optimize. Traders also can avoid curve-fitting by continuing to test a system beyond backtesting using in- and out-of-sample data, and by taking advantage of forward performance testing.
Data considerations
When testing a trading idea, the historical data can be divided into two or more segments to provide more reliable results. The data that is used during the initial testing and optimization is called the in-sample data. The second data set, referred to as the out-of-sample data, is a "clean" data set that is not used until the in-sample backtesting and optimizations have been completed. Because the out-of-sample data has not been used in any of the optimizations, traders can apply the optimized system to this reserved historical data to determine if the two data sets provide similar results.
Before backtesting or optimizing begins, the historical data can be divided into two distinct periods to accommodate in-sample and out-of-sample testing. One method is to divide the data into thirds, reserving one-third for out-of-sample testing and using the remaining two-thirds for in-sample testing and optimization. To clarify, to preserve the out-of-sample data, only the in-sample data should be used during any optimizations.
The results of the in- and out-of-sample testing can be evaluated by comparing the performance results or by reviewing the corresponding equity curves of the two data sets. Positive correlation exists where the results are similar, and this shows that the system has promise. Negative correlation, where the out-of-sample results are poor compared to the in-sample results, indicates that the system may have been curve-fit to match the in-sample data.
The stronger the correlation between the in- and out-of-sample testing, the higher the probability that the system will do well in forward performance testing and live trading. "Positive promise" (below) illustrates a strategy that has positive correlation between in- and out-of-sample testing, as well as forward performance testing. There is a good probability that this strategy would perform well in live trading.

Forward performance
Forward performance is the next phase of evaluation, and provides traders with an additional set of out-of-sample data on which to test the system. Sometimes called paper trading, forward performance testing is the simulation of actual trading. The system’s logic is applied to a live market, but all trades are executed on paper only — trade entries and exits are recorded, but no real trades are initiated. Many trading platforms boast a simulated environment where traders can complete forward performance testing, as well as practice placing trades and become acquainted with the platform’s user interface.
Again, traders are looking for correlation between the in-sample, out-of-sample and forward performance testing results. The greater the correlation, the more confident the trader can be that he or she has developed a viable trading system.
It should be noted, however, that even with strong correlation, the results of live trading often fall short of any testing results because of slippage and actual fill levels. In testing results, for example, a profitable trade exit may have occurred at the high of a bar before price dropped. In live trading, this same position may turn out to be a loser if the limit order for a profit target never fills because price only briefly touched, then retreated from the profit target level.
Traders can review backtesting results further by looking at price charts that have the trade entries and exits marked, taking into consideration any trades that seem to be "on the edge" of what might actually be executed in live trading.
Any trading idea that can be quantified can be tested to determine how well it might perform in a live environment. While backtesting can provide valuable information regarding a system’s potential, backtesting alone often produces deceptive results. The process of evaluating a trading system before risking real money involves backtesting and optimizing on in-sample data, testing on out-of-sample data, and forward performance testing. Positive results and good correlation among the testing phases increases the probability that the system will perform well during live trading.
Jean Folger is the co-founder of, and system researcher with, PowerZone Trading, LLC. Jean can be reached at www.powerzonetrading.com.