Neural networks are fickle tools.
You never will get exactly the same result on two different initialization and training exercises. This is one reason artificial intelligence (a.i.) trading has stagnated since the mid to late 1990s. Most end users cannot accept the peculiarities of popular tools, and the demand for mainstream innovation has stalled out.
But innovation hasn’t stopped on all fronts. Many ideas have advanced trading a.i. into new frontiers. One is kernel regression, which is a supervised modeling method, just like backpropagation, but it does not start with random initial conditions. This means you don’t need to retrain a model using the same input/output data 10 times to deal with the initial condition problem.
Kernel regression does have its own issues. Typically, it loses robustness quickly with too many inputs. A good rule of thumb is to stay below 15 inputs, and preferably 10; otherwise, expect problems. However, with smart design and domain expertise, this limited number of inputs is not an issue.
Kernel regression is known as a support vector machine (SVM) algorithm. This type of algorithm constructs an n-dimensional space that separates the data into different classifications. SVM models are closely related to neural network models, and a sigmoid kernel function is equivalent to what can be done with a two-layer feed-forward neural network.
Using a kernel function, SVMs are an alternative training method for polynomial, radial-basis functions and multi-layer perception classifiers, where the weights of the network are found by solving a quadratic programming problem with linear constraints. This is done rather than solving a non-convex, unconstrained minimization problem, as in standard neural network training.
SVMs are described as having attributes, and a transformed attribute that is used to define the hyperplane is called a feature. Choosing the most suitable representation is known as feature selection. A set of features that describes one case (that is, a row of predictor values) is called a vector. The goal of SVM modeling is to find the optimal hyperplane that separates clusters of vectors in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other side of the plane. The vectors that are near the hyperplane are the support vectors.
A two-dimensional example helps to visualize the concept. Assume that our data have a target variable with two categories. Also, assume that there are two predictor variables with continuous values. If we plot the data points using the value of one predictor on the X-axis and the other on the Y-axis, we might end up with an image such as the one shown in “Cuts two ways” (right). One category of the target variable is represented by rectangles, while the other category is represented by ovals.
In this idealized example, the cases with one category are in the lower left corner, while the cases with the other category are in the upper right corner; the cases are completely separated. The SVM analysis attempts to find a one-dimensional hyperplane (a line) separating the cases based on their target categories. An infinite number of possible lines to separate the categories exists; there are two candidate lines shown in the example. We need to determine which line is a better divider.
The parallel dashed lines mark the distance between the dividing line and the closest vectors to the line. The distance between the dashed lines is called the margin. The vectors (points) that constrain the width of the margin are the support vectors. An SVM analysis finds the line that is oriented to maximize the margin between the support vectors. In the figure, the line in the right panel is superior to the line in the left panel.
If all analyses consisted of two-category target variables with two predictor variables and clusters of points that could be divided by a straight line, life would be easy. Unfortunately, this is not generally the case.
WHEN LINES GO CROOKED
What if your data points can be separated with a straight line? In other words, they are linear. Rather than fitting nonlinear curves to the data, SVM separates the data by using a kernel function to map the data into a different space where a hyperplane can be used to do the separation (see “Different view,” page 40).
The concept of a kernel mapping function is powerful. It allows the SVM models to perform separations with very complex boundaries.
Many kernel mapping functions can be used — probably an infinite number. However, few kernel functions have been found to work well in a wide variety of applications. The default and recommended kernel function is the radial basis function (RBF).
There are two parameters used in RBF kernels, C and Y. It is not known beforehand which C and/or Y is the best solution for one problem; consequently, some kind of model selection (parameter search) must be done. The goal is to identify good (C,Y) so that the classifier can accurately predict unknown data (the training data). A common approach is to separate training data into two parts where one is considered unknown in the training of the classifier. Then, the prediction accuracy on this particular set can more precisely reflect the performance on classifying unknown data. An improved version of this procedure is known as cross-validation.
In v-fold cross-validation, the training set is divided into v subsets of equal size. Sequentially, one subset is tested using the classifier trained on the remaining v-1 subsets. We use each instance of the whole training set in predicting the cross-validation accuracy, which is the mean squared error (MSE). The cross-validation procedure can prevent the over-fitting problem that was discussed earlier in this manual.
The grid search determines the C and Y using cross-validation. Basically, pairs of (C,Y) are tried and the one with the best cross-validation accuracy is picked. It was found that trying exponentially growing sequences of C and Y is a practical method to identify good parameters.
BREAKING THE MOLD
Although the computing power exists today to create incredibly complex neural networks trained over massive sets of data, the truth is neural networks are like any trading system: the more complex, the more likely they won’t work in the future.
When testing inputs in preprocessing, we need to sample data. For example, if we use ADX (average directional movement) in a model, we might want to sample ADX by using the current value, the value two bars ago, the value five bars ago, 10 bars ago and 20 bars ago. That is five inputs, so you can see how networks can easily get 30, 40, 50 inputs.
Remember, if we want to use deterministic methods like kernel regression, this is a problem. Classic old school development uses this type of approach. If we wanted to predict 10 days out, we would have seven inputs for each variable and have seven or eight variables, this means we have 50 or more inputs to our models. Kernel regression can’t use this paradigm. Never mind that any analyst would be hard-pressed to make sense of all the relationships in such a complex model.
So, if our goals are to use deterministic methods such as kernel regression, have models that we understand, and are robust, what is the solution?
It’s simple. We build smaller models, composed of logical components that themselves are predictive. We will test different combinations of these and build multiple models. Then, we will combine these models to create a composite model. Each model, or expert component, will look at the problem slightly differently because components will vary.
We can build these expert components manually, like how we build trading systems, and have them output 1, 0, -1, for example. We also could evolve them, using genetic algorithms or another machine learning method. We also could create fuzzy outputs, like an indicator measuring intermarket signal strength, which not only would be based on direction but also on the time since divergence first occurred. If all goes as planned, the experts would look at the problem differently and increase the robustness of our solution.
Building these components requires domain expertise. They can be built manually. We also could use genetic algorithms to evolve them. We can use forecasting methods such as linear regression forecasts and employ a genetic algorithm to predict the error in these linear forecasts.
In addition we can use descendants of older algorithms. We don’t always need our components to give buy/sell outputs. They can be forecasts; they also can be advance composite indicators that tell us about trend strength and market modes. What we are doing is allowing neural networks or kernel regression to combine components into an expert component.
When we build these components, as well as more advanced parts of this solution, we will use them in a walk-forward way. This means that sometimes a component will stop working and we need a way of replacing it.
Now let’s define the terminology we will be using:
1. Expert component: This combines two or more inputs into one output with a gain of information relating to that output. Components are designed to extract knowledge from the inputs. This can be as simple as a technical indicator or several indicators combined to produce a single output. It also can be used to produce simple forecasts.
2. Knowledge blocks: Apply multiple expert components to create a model using them. These models combine experts using methods like neural network, kernel regression and evolved solutions using genetic algorithms.
3. Collaboration block: These are multiple knowledge blocks that use different expert components and look at the solution slightly differently. These blocks can be combined like a voting scheme or trained using neural network/machine learning methods.
4. Meta-goal: This is a combination of collaboration blocks that create an end trading solution with buy/sell signals.
All of these components need to work in a walk-forward manner. The expert components might be long-term relationships and not based on walk-forward analysis, but the other levels need to be trained and created in a walk-forward manner. It’s possible that the knowledge block will not train correctly in some walk-forward windows. This means that collaboration blocks also need to be changed, sometimes as we walk forward.
We work around these issues with the following mechanisms:
1. Component supervisor: Controls which expert components are still valid on current walk forward window. Sometimes this can be omitted for long-term relationships that will be based on long-term fixed parameters.
2. Knowledge block supervisor: Selects components to use based on component supervisor screening.
3. Collaboration block supervisor: Evaluates these trained knowledge blocks and decides which ones to use in collaboration blocks.
This new paradigm allows us to break a solution into multiple pieces that work together and can adapt to the markets as a unit.
Consider the below variables that could be used to build expert components for an S&P 500 system based on the relationship between the 30-year Treasury and S&P 500:
• Inputs for expert components
• Intermarket indicators of divergences
• Correlation between intermarkets
• Long-term trend
• Intermediate-term trend
• Predictive correlation
• Trend indicators
We would take two to three versions of these concepts and use them as inputs to kernel regression. Because we need to keep the number of inputs low, we can’t do much sampling for the model. The sampling must be done at this expert component level. This means we need to develop an input which represents what we want.
This model represents the classic intermarket relationship for the S&P 500
• If ((Close of SP500) > (Average Close of SP500)) And ((Close of T-Bonds) < (Average Close of T-Bonds)) Then Sell SP500
• If ((Close of SP500) < (Average Close of SP500) And (Close of T-Bonds) >(Average Close of T-Bonds)) Then Buy SP500
For example, we can use this logic and build a component that will output a 1 when the buy signal is generated and -1 when we get a sell. These systems are always in the market in reversal strategy.
There also would be a time element. That is, when a signal is generated opposite of the previous signals, and we have another model that has not yet reversed, we could build either one or two components that can express this divergence as well as its current mode. These components could use fuzzy logic to create a single output based on divergence, time and the mode of the current divergence.
We also could take three sets of moving averages that work well but are far enough apart that they don’t produce the same results. We then simply add the output, and produce a final intermarket divergence number. This could be used as the component. We could optimize the ones to use based on creating a system that uses these rules. We also use generic optimizing to maximize n-bar returns when combining intermarket divergences using different moving average lengths.
In addition, we need to deal with intermarket correlation, predictive correlation, trend and strength of trend because intermarket relationships work differently based on these elements. Also, we should use several different measures of trend and use a different one in each knowledge block. The same is true for how we look at correlation. We then can combine four to 10 of these components per model and train them. Our collaboration block then will create one output. One or more of these collaboration blocks can be combined into a meta-goal, our final system.
This process represents more than just a new trading technique. It’s a new way of thinking about system development. As such, it doesn’t exploit new technologies to hammer different solutions out of old ideas. It leverages the potential of those technologies by forging brand new ways of building those solutions. In future articles, we will increasingly rely on these new methods to demonstrate the positive effect of technology on trading.
Murray A. Ruggiero Jr. is a consultant. His firm, Ruggiero Associates, develops
market timing systems. He is the author of "Cybernetic Trading Strategies" (Wiley). E-mail him at firstname.lastname@example.org.