The first wave of neural network trading applications formed during the 1990s, and then crested and crashed as the Internet stock bubble burst. Neural network applications in trading research really dropped during the first half of the 2000s. Few good research papers were published during this period. Even today, the tools remain relatively out of favor, with the exception being support vector machine algorithms vs. the back-propagation and radial-net that were popular during the first wave.
The early approach was to train models over static sets of data and out-of-sample periods. In addition, many of the early neural network models were adapted from time series forecasting methods or classic signal processing, and did not draw on deep domain knowledge of financial markets. Because so many models were developed, some of the models had to seem like they worked just due to chance.
Problems & enablers
Developers of market models face two related problems:
- Many data series are not stationary, lacking a constant mean. A requirement for stationarity needs to be addressed with preprocessing.
- Regime change. A set of model inputs works for one regime but not another, so a system or set of inputs might work for six months to a year or more and then fail.
Statistical procedures such as the Augmented Dickey Fuller and Phillips Perron tests can detect if a time series is stationary. If we know a time series is not stationary, we can try to make it so through preprocessing. Methods include taking first and second differences of the time series. In addition, current development can use walk-forward technology to see if model stability is retained on different data sets and represents stable relationships.
Today, sponsors of predictive model development recognize how important domain expertise is and how neural nets are better utilized as only part of the solution.
Here, we will demonstrate this approach by first providing an overview of a classic time series forecasting method, Box-Jenkins, for historical perspective. We will then reuse some of the concepts when we implement our final neural network models. Along the way, we will find that identical terms are used differently in trading and forecasting. To minimize confusion, these terms will be identified.
Among the most popular time series forecasting methods that also have been used for financial time series are Box-Jenkins models, also known as ARIMA models.
Many academic papers discuss the use of Box-Jenkins as a method for market time series forecasting, from papers in the early 1990s on the S&P 500 to relatively recent studies, such as the 2008 effort, “Comparing the performance of time series models for forecasting exchange rates,” BRAC University Journal, vol. V no 2, 2008, 55-65, by M.K Newaz. The researcher compares various classic time series models on the India rupee, including the ARIMA models, and finds that ARIMA performs well. Newaz finds that the first difference of the rupee series, not the series itself, is stationary.
ARIMA stands for Autoregressive Integrated-Moving-Average. The “integrated” indicates that the time series is transformed into a stationary series. The “auto” means the transformed series is self-referential. ARIMA represents three different types of models:
- AR, autoregressive
- MA, moving average
- ARMA, both AR and MA in the same model
An AR model is like a simple linear regression model, except that the independent variables are, in practice, time-lagged versions of the dependent variable, time series itself; thus, it is autoregressive. An autoregressive model can have multiple terms and be either linear or nonlinear.
An MA model is a weighted moving average of a fixed number of previously produced forecast errors. Traders expect a moving average to be of the series itself, such as an average of closing prices, but the average in this case is of forecast errors. The term is conceptually used identically across trading and forecasting, but the application differs. The average error is used to correct the error of the regressive model.
Box-Jenkins models are univariate, based on a single time series. The model establishes a relationship between present and past values of the series so the past values can then be used in forecasting. These models require stationary series. Even though Box-Jenkins has been used in many studies to forecast market data, the method is not totally suitable because market data are not stationary and do not have a normal distribution.
Developing a Box-Jenkins model requires four steps:
- Model identification
- Model estimation
- Diagnostic checking
The last three steps are similar to those for linear regression, such as the use of the Pearson correlation coefficient and the t-statistic, so we omit them here on the assumption that they are familiar to most readers (and available for review in any basic statistics text, if necessary). The first step in developing a Box-Jenkins model, however, requires the judicious use of discretion based on domain expertise.
To develop a model, we must identify the proper form — AR, MA or ARMA — and how many terms are needed. We answer these questions with autocorrelation functions of the series, the autocorrelation function (ACF) and the partial autocorrelation function (PACF).
ACF and PACF are like classic correlation functions, with values from -1.00 to 1.00 if the time series is stationary. In an exception to the classic definition, PACF uses lagged values of the time series itself for the independent variable.
When the regression includes only one independent variable of one-period lag, the coefficient of the independent variable is called a first-order partial autocorrelation function. If a second term of two-period lag is added, the coefficient of the second term is called a second-order partial autocorrelation function, and so on.
To identify an appropriate model, we plot the ACF and PACF in a correlogram for a good visual indication of our model. The pattern in the ACF and the quantity of spikes in the PACF tell us how many terms we need. If the ACF lag correlation fades quickly and the PACF has only one spike, then we use an AR of the first order.
One reason that Box-Jenkins is important is that early work for predicting market data used neural networks in the methodological place of Box-Jenkins. Comparisons showed neural network models, like back-propagation or kernel regression, performed as well or better than Box-Jenkins. Box-Jenkins uses only the price data itself, but neural networks can include truly independent variables, such as intermarket relationships, fundamental data, etc. The ability to use truly independent series makes neural networks and kernel regression more powerful for forecasting market data.
The computer speed and software available today expand how we can use neural networks and kernel regression. We can use them for models that trade at the portfolio level because software like TradersStudio offers neural network technology within a strong portfolio-based trading platform. Increases in computer speed make these solutions feasible. TradersStudio also offers advance handling for splits and dividends of equities. As a result we can, using neural networks, trade stocks, exchange-traded funds and mutual funds using baskets of instruments, and can trade futures portfolio systems, in addition to developing single-market systems such as those for the S&P 500.
Market data effects
What we trade largely affects how we preprocess the data. Stated generally, the kind of data we use greatly affects the design of the preprocessing for our models and the outputs that we predict.
Many classic preprocessing methods use percentage changes to normalize price changes, but if we use back-adjusted futures data, we cannot use ratios. This create a problem. Back-adjusting destroys price levels by adjusting out contract roll gaps. Range is not distorted in back-adjusted contracts, so if we use range to normalize attributes such as (Close-Close)/Average(range,10), we’re OK.
If we want to use more classic ratios, we need to use ratio-adjusted futures contracts, cash series or index data. If we are using data for ETFs, mutual funds or stocks, then split-adjusted data work fine because split-adjusted data are logically analogous to ratio-adjusted future contracts.
Inputs, such as unemployment numbers and inflation gauges, require longer bar time frames (for example, monthly). We need to normalize the release date so that monthly data sum correctly. The easiest way is to use the last trading day of the month regardless of when in the month the data actually were released. If we are using daily or weekly data, we can use the released date to predict shorter-term trends.
Noise also affects preprocessing. We need to design preprocessing for a monthly or weekly model differently than what we would do for daily or intraday data. For example, price changes might readily be predictable on a weekly time frame, hardly predictable on a daily time frame, and unfathomable with intraday bars. You need to develop different types of targets based on the time frame being predicted.
The types of markets that we trade also affect our predictions. Currencies trend, stock indexes traverse a mean and other markets are choppy and noisy.
Input & output training
Preprocessing of data simplifies relationships between the input preprocessing and our desired target. The goal of preprocessing is to make these relationships clear enough that the neural network or kernel regression algorithm can model them. Modeling is not as easy as it sounds for neural nets. Not only do we need preprocessing, which makes these relationships clear, but the distribution of outcomes affects how well the neural network will train. When we use these algorithms, we need to scale the data before we pass that data to the algorithm.
Let’s look at back-propagation. Its two most common activation functions are the sigmoid, which scales from 0 to 1, and the hyperbolic tangent, which scales from -1 to 1.
- Sigmoid: S(t)=1/(1+e-t)
- Hyperbolic tangent: Tanh(Z)= eZ-e-Z/eZ+e-Z
Saturation & clipping
When developing preprocessing for networks with either type of activation function, we have a problem if the bulk of the examples are above the saturation point. Both sigmoid and hyperbolic tangent saturate to -1,-1 (TanH at values of +/- 2.5 and sigmoid at values +/- 5.0) so it’s important to look at the scaling when we are developing preprocessing. See “Logistic curve” and “Hyperbolic tangent curve” (below). Saturation easily can occur because market data do not normally have a zero mean. We want 90%-95% of the data to be in the linear part of these functions because learning does not take place for values in the saturation range. For both cases of activation functions, values below a given threshold produce a response proportional to the input value. The output response is clipped: Between 0 and 1 for the sigmoid output for inputs between -1 and 1. Input values between -5 and 5 are clipped for the hyperbolic tangent.
As a result, if the financial data are used without preprocessing and have an average offset and volatility so most of the time series is outside the clipping range, the input neuron is pinned at the maximum or minimum value, and the data can serve no purpose in training the network. The output data from a back-propagation network are obtained from the output of these simulated neurons as well and, as such, are limited in range. If data for which the network is being trained to produce are outside the clipping range, the network never accurately reproduces results and the best training is likely to be a random choice among equally bad networks.
The solution to these difficulties is to transform, or scale, the data. The two-part purpose of the data transformation is:
- For all the input values used to train the network, the data are in the linear portion of the activation function, where changes in the input values produce meaningful changes in the input neuron response.
- Output training data lie in the range allowed by the type of output neurons, so the network accurately matches the unscaled training data.
In summary, the forecast process is:
- Start with real-world data.
- Scale that data to the clipping range of the activation function.
- The model produces scaled outputs.
- “De-scale” the outputs to real-world forecast data.
For simplicity, scaling is done linearly by removing a constant offset from the original price data and then multiplying the result by a scale factor so the volatility of the scaled data is within the allowed clipping range for the activation function.
One way to choose the scaling factor is to determine the maximum and minimum values of the input data and to choose the offset and scale factor so that the minimum–maximum range exactly fits the input range for the input training data and the neuron output range for the output training data.
A drawback of this type of scaling is that it can be inordinately influenced by a small number of outlier events, in which the financial data exhibit large swings that are better ignored by the network.
To address the problem of outlier events, an improved scaling method is as follows:
- Evaluate the statistical mean and standard deviation for the input and output training data.
- Use the calculated average as the offset to remove from the data.
- Scale the data so that the standard deviation fits the allowed input and output ranges of the network.
Once a network is trained with scaled data, it is important to scale the input data when the network is used as a predictor, and to apply the inverse of the output scaling on the prediction value from the network, to use the network in trading.
The radial basis function algorithms and the kernel regression methods require the input and output training data to be scaled. The best practice is to transform each input data feature, and the output data being modeled, so that the final scaled data are centered on zero and have fluctuations of approximately unit magnitude. Choices that the analyst makes to meet this requirement determine how important the outlier financial values are to training the network algorithm. Domain expertise is crucial to the correct choices.
Support vector machines
Many artificial intelligence solutions like classic neural networks, either back-propagation networks or radial nets, begin with random weights at the start of trading. We then keep retraining from different starting points. Genetic algorithms begin the same way, with a random population.
In the hands of a skilled analyst, these approaches can work well, but they create a problem: It’s difficult to duplicate results. Training results do not match over multiple training sessions, defeating the scientific method and making traders uncomfortable.
As a solution built on linear algebra, kernel regression enables duplication of outputs. For the same input parameters and preprocessed data, it always produces the same results. But kernel regression typically loses robustness quickly with too many inputs. A good rule of thumb is use fewer than 15 inputs, preferably fewer than 10. With smart design and domain expertise, a limited number of inputs is a strength, not a weakness, in system development.
Kernel regression is a support vector machine (SVM) algorithm. An SVM constructs an n-dimensional space that separates the data into different classifications. SVM models are closely related to neural network models: A sigmoid kernel function is equivalent to what can be done with a two-layer feed-forward neural network.
Using a kernel function, SVMs are an alternate training method for polynomial, radial-basis functions and multi-layer perception classifiers. Rather than solving a non-convex, unconstrained minimization problem, as in standard neural network training, the weights of the network are found by solving a quadratic programming problem with linear constraints.
A two-dimensional example helps to visualize the SVM concept. Assume that our data have a target variable with two categories, and assume two predictor variables with continuous values. If we plot the data points using the value of one predictor on the x-axis and the other on the y-axis, we might produce a recognizable shape. Rectangles represent one category of the target variable, while the other category is represented by ovals. In this idealized example, the cases with one category are in the lower left corner. Cases with the other category are in the upper right corner. The cases are completely separated.
SVM analysis attempts to find a one-dimensional hyperplane, the line separating the cases based on their target categories. An infinite number of possible lines to separate the categories exists. Two candidate lines appear in “Hyperplane alternatives” (below). We need to determine which line is a better divider.
SVMs are described as having attributes. A transformed attribute that is used to define a hyperplane is called a feature. A set of features that describes one case, a row of predictor values, is called a vector.
The goal of SVM modeling is to find the optimal hyperplane that separates clusters of vectors so that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other side of the plane. The vectors that are near the hyperplane are the support vectors.
The parallel dashed lines mark the distance between the dividing line and the vectors closest to the line. The distance between the dashed lines is called the margin. The vectors (points) that constrain the width of the margin are the support vectors. An SVM analysis orients the line to maximize the margin between the support vectors. In “Hyperplane alternatives” the line in the right panel is superior to the line in the left panel.
When good lines go bad
If all analyses consisted of two-category target variables with two-predictor variables and clusters of points that could be divided by a straight line, life would be easy. Unfortunately, this is not the case.
What if your data points are nonlinear and, thus, cannot be separated with a straight line? Rather than fitting nonlinear curves to the data, a SVM model separates the data by using a kernel function to map it into a different space where a hyperplane can separate it.
The concept of a kernel mapping function is powerful. It enables SVM models to separate data with complex boundaries. Visually, the kernel regression method finds a surface in a higher dimensional space that passes as much of the input data as possible.
Many kernel mapping functions can be used — probably an infinite number. However, few kernel functions work well in a wide variety of applications. The default and recommended kernel function is the radial basis function.
RBF uses two variables conventionally labeled as C (cost) and Γ (for curvature, pronounced as the Greek gamma). The Γ parameter determines how curved or convoluted the surface is allowed to be. A value of zero for Γ corresponds to a flat or planar surface, while higher (positive) Γ values allow the surface to make tighter bends in the higher dimensional space. When the input data are scaled to have typical fluctuations in the -1 to +1 range, the best Γ values are in the range 0 to 1.
The C parameter determines how closely the kernelly regressed surface attempts to match the training set data. A large value of C forces the algorithm to attempt to fit each training set. A smaller value allows the surface to miss some outliers, possibly improving the overall predictive value of the surface.
The analyst does not know beforehand which C or Γ is the best solution to the problem. Determining the best values of C and Γ is an empirical question to be answered by fitting the kernel regression for a range of parameters and by choosing the best set of C and Γ by evaluating a quality-of-fit factor for each set.
The goal is to identify good C, Γ pair so the classifier can accurately predict unknown data, the training data. A common approach is to separate training data into two parts, one considered unknown in the training of the classifier. The prediction accuracy on this particular set more precisely suggests the performance on classifying unknown data.
An improved version of this procedure is known as cross-validation. Cross-validation prevents over-fitting. In v-fold cross-validation, the analyst divides the training set into v subsets (folds) of equal size. Sequentially, one fold is tested using the classifier trained on the remaining v-1 folds. We use each instance of the whole training set in predicting the cross-validation accuracy, which is expressed with the mean squared error.
A grid search determines C and Γ using cross-validation. Pairs of C and Γ are tried and the pair with the best cross-validation accuracy is selected. Exponentially growing sequences of C and Γ is a practical method to identify good parameters.
Developing trading systems
The key to developing reliable and intelligent trading strategies using neural network or kernel regression is to integrate these technologies with already profitable trading strategies, rather than to expect the neural or kernel model to be everything.
Efforts from the 1990s use dual- or triple-moving-average crossovers. We now can use intelligent neural or kernel technology to predict such crossovers early. It’s often possible to decrease lag by predicting up to 30% of the period of the shorter moving average. We want to find robust moving average parameters with only slight degrades, being a few bars late but capturing bigger profits by predicting a few bars early. When we have this type of modeling technology, predicting the crossovers can be effective.
Another trend-following strategy is to predict price highs or lows within a window. If the market breaks out of that window, you take the trade as a neurally assisted channel breakout.
An interesting strategy is to look at both positive and negative price excursion, buying when positive excursion is double the negative excursion and selling when negative excursion is double the positive excursion.
Or, assume we want to develop a stock trading model. We can combine fundamental and technical analysis with a target, such as relative strength of a given stock to the S&P 500, and use the combination as a stock-selection method.
Yet another strategy is to predict when the technical model fails: For example, to predict when not to follow the signals generated by an intermarket model, or when to exit a trade of an intermarket system.
In our final article in this series, we will develop one or more case studies of a real trading strategy using neural networks or kernel regression.