Neural networks are fickle tools.

You never will get exactly the same result on two different initialization and training exercises. This is one reason artificial intelligence (a.i.) trading has stagnated since the mid to late 1990s. Most end users cannot accept the peculiarities of popular tools, and the demand for mainstream innovation has stalled out.

But innovation hasn’t stopped on all fronts. Many ideas have advanced trading a.i. into new frontiers. One is kernel regression, which is a supervised modeling method, just like backpropagation, but it does not start with random initial conditions. This means you don’t need to retrain a model using the same input/output data 10 times to deal with the initial condition problem.

Kernel regression does have its own issues. Typically, it loses robustness quickly with too many inputs. A good rule of thumb is to stay below 15 inputs, and preferably 10; otherwise, expect problems. However, with smart design and domain expertise, this limited number of inputs is not an issue.

Kernel regression is known as a support vector machine (SVM) algorithm. This type of algorithm constructs an n-dimensional space that separates the data into different classifications. SVM models are closely related to neural network models, and a sigmoid kernel function is equivalent to what can be done with a two-layer feed-forward neural network.

Using a kernel function, SVMs are an alternative training method for polynomial, radial-basis functions and multi-layer perception classifiers, where the weights of the network are found by solving a quadratic programming problem with linear constraints. This is done rather than solving a non-convex, unconstrained minimization problem, as in standard neural network training.

SVMs are described as having attributes, and a transformed attribute that is used to define the hyperplane is called a feature. Choosing the most suitable representation is known as feature selection. A set of features that describes one case (that is, a row of predictor values) is called a vector. The goal of SVM modeling is to find the optimal hyperplane that separates clusters of vectors in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other side of the plane. The vectors that are near the hyperplane are the support vectors.

A two-dimensional example helps to visualize the concept. Assume that our data have a target variable with two categories. Also, assume that there are two predictor variables with continuous values. If we plot the data points using the value of one predictor on the X-axis and the other on the Y-axis, we might end up with an image such as the one shown in “Cuts two ways” (right). One category of the target variable is represented by rectangles, while the other category is represented by ovals.

In this idealized example, the cases with one category are in the lower left corner, while the cases with the other category are in the upper right corner; the cases are completely separated. The SVM analysis attempts to find a one-dimensional hyperplane (a line) separating the cases based on their target categories. An infinite number of possible lines to separate the categories exists; there are two candidate lines shown in the example. We need to determine which line is a better divider.

The parallel dashed lines mark the distance between the dividing line and the closest vectors to the line. The distance between the dashed lines is called the margin. The vectors (points) that constrain the width of the margin are the support vectors. An SVM analysis finds the line that is oriented to maximize the margin between the support vectors. In the figure, the line in the right panel is superior to the line in the left panel.

If all analyses consisted of two-category target variables with two predictor variables and clusters of points that could be divided by a straight line, life would be easy. Unfortunately, this is not generally the case.

WHEN LINES GO CROOKED

What if your data points can be separated with a straight line? In other words, they are linear. Rather than fitting nonlinear curves to the data, SVM separates the data by using a kernel function to map the data into a different space where a hyperplane can be used to do the separation (see “Different view,” page 40).

The concept of a kernel mapping function is powerful. It allows the SVM models to perform separations with very complex boundaries.

Many kernel mapping functions can be used — probably an infinite number. However, few kernel functions have been found to work well in a wide variety of applications. The default and recommended kernel function is the radial basis function (RBF).

There are two parameters used in RBF kernels, C and Y. It is not known beforehand which C and/or Y is the best solution for one problem; consequently, some kind of model selection (parameter search) must be done. The goal is to identify good (C,Y) so that the classifier can accurately predict unknown data (the training data). A common approach is to separate training data into two parts where one is considered unknown in the training of the classifier. Then, the prediction accuracy on this particular set can more precisely reflect the performance on classifying unknown data. An improved version of this procedure is known as cross-validation.

In v-fold cross-validation, the training set is divided into v subsets of equal size. Sequentially, one subset is tested using the classifier trained on the remaining v-1 subsets. We use each instance of the whole training set in predicting the cross-validation accuracy, which is the mean squared error (MSE). The cross-validation procedure can prevent the over-fitting problem that was discussed earlier in this manual.

The grid search determines the C and Y using cross-validation. Basically, pairs of (C,Y) are tried and the one with the best cross-validation accuracy is picked. It was found that trying exponentially growing sequences of C and Y is a practical method to identify good parameters.