From the December/January 2013 issue of Futures Magazine • Subscribe!

Neural networks: The dream that won’t die

Neural networks, if used properly, can provide the framework for a plethora of market analysis tools that can supplement an existing trading program or suggest new directions for future research. While the history of these tools dates back much further, their modern application took root in the late 1980s and came of age in 1993 when patent no. 5241620 was awarded to this author for the concept of embedding a neural network into a common spreadsheet. Suddenly, neural networks were not just part of the professional mainstream, but the average trading populace could access them.

The analytical foundation for this leap is built on an algorithm called back propagation. In layman’s terms, this is a method that allows a network to learn to discriminate between classes that can’t be distinguished based on linear properties. Rumelhart, Hinton and Williams presented a well-received paper on what they called “Backward propagation of errors” in 1985. Others who did research into this approach include David Parker and Paul Werbos. Werbos arguably invented these techniques and presented them in “Introduction to Pattern Analysis,” his 1974 Ph.D. dissertation at Harvard.

The back propagation algorithm consists of a multi-layer perception that uses non-linear activation functions (see “Simple net,” below). The most commonly used functions are the sigmoid, which ranges from 0 to 1, and the hyperbolic tangent function, which ranges from -1 to 1. All inputs and target outputs must be mapped into these ranges when used in these types of networks.

The “magic” of back propagation, or backprop, is that mathematical calculations (the type typically found in first-year calculus) adjust the weights of the connections to minimize the error across the training set. An important attribute of these methods is they generate a reasonably low error across the training set of inputs. However, they do not find the absolute minimum error, but the local minimum. This means that training a neural network is not exact and depends on the precise data set. Repeating the same experiment does not always give the same answer. 

Backprop, in its original form, had a lot of issues. Many variations of this algorithm attempt to resolve those weaknesses. Early ideas used momentum and variable learning rate adjustment techniques, such as simulated annealing. When newer tactics are combined with older ones, the combination can optimize learning. For example, we perform batch learning in parallel so that we can run it on multiple cores, saving a tremendous amount of time. All of these variations are supervised learning algorithms: We give them input patterns and train them to output a certain target set of results. In doing so, we map the patterns, which in turn allows us to generalize for new patterns that were not used in training.

There are other algorithms, such as radial nets and kernel regression (also known as Support Vector Machines). All of these algorithms can be used to create approximations of non-linear functions. This approximates how neural networks map a given input to an output. Put simply, we create a universal function “approximator” that, given a set of inputs, can provide a good idea of what the optimal solution to a problem would be. 

Client driven

As with most things, interest in neural networks took off when the customer started demanding it. Traders, hungry for the next big thing, were clamoring for the technology in the early 1990s. However, the vast majority of these traders had no background in the processes — and those who had the background knew nothing about the markets or how they work.

But neural networks were not the perfect solution, and after many years of trial and error, it became clear why: Standard neural-network-based signal processing techniques simply do not work in the markets as signal generators. In other words, the process of implementing neural networks correctly must begin far earlier in trading system development. You must create systems that work well without neural networks for neural networks to be able to improve them. 

However, this knowledge was not common in the early 1990s. Many large institutions brought in neural network expertise that used methods of building models that did not incorporate domain expertise (market knowledge). Errors of this type were prevalent even among large banks with huge budgets.

In one specific case, a large bank had an in-house team that developed a trading model using deterministic models, such as the Mackey Glass equation, to simulate data. Mackey Glass is a time-delay differential equation that can generate a curve that looks like a stock market price series. The equation most often is used to model many biological processes, such as white blood cell circulation. Expertise in these methods was used as a substitute for market knowledge. They failed completely.

Another approach that didn’t work was price change forecasting. Even worse, many used these forecasts as the core of a trading approach. The failure is obvious with hindsight. When this approach was good, it was absolutely amazing. When it was bad, it was horrible. These methods were, in effect, gambling, and the gamblers eventually lost.

That’s when many stepped back and started to use these new tools to solve classic trading problems instead of treating them like the Holy Grail of price forecasting. For example, one classic problem with all traditional technical indicators is lag. However, by using a neural network, we could minimize lag and improve the performance of the indicators. One viable approach was using a neural network to predict a moving average crossover two to three bars early in Treasury bond futures. It worked. Similar work was done with moving average convergence-divergence and the ADX.

Despite the ultimate application, all successful implementations of this new philosophy had this in common: They were designed to avoid the worst-case failure rather than go for the home run.

An example of this is using George Taylor’s so-called “book method” to trade the S&P 500. It was covered in Futures in the late 1990s (see, for example, “Born again neural nets,” February 1999). The key to this trading method is the entry trigger: Buying on a limit at the previous day’s low and selling on a limit at the previous day’s high. Two neural networks were used: One to predict the difference between today’s low and yesterday’s low, and one to predict the difference between today’s high and yesterday’s high. Safety measures were integral to the design, so if the neural network failed and the output became random noise, results would simply revert to that of the original system. Any variation from the core, non-optimized strategy would be all in the upside. 

Back to the future

We can use these high-level techniques to enhance components of an already profitable strategy. This approach requires domain expertise in classic rule-based market strategies as well as expertise in neural networks, signal processing and cycle processing. This combination is the only way to use these technologies successfully.

Unfortunately, frustration born from an inability to accept this reduced role for neural networks pushed many analysts away from the method. As analysts, we simply had been trying to make neural networks do things they couldn’t do: Predict future price changes.

Now, two key realizations have made neural networks viable again. First was the development of sound, robust trading strategies that these tools could enhance. Second was technology that reduced the expense and time of implementing these resource-intensive techniques. Advances in computing power and software tools such as .NET 4.5 have made it easier to develop multi-core, multi-server implementations.

These two developments couldn’t have come to pass soon enough. Put simply, the markets have become noisier and harder to trade. Realistically, these advanced technologies can improve performance by 20% to 40%, but until recently the time to implement them would have taken two to three times longer than developing the original system. It now makes sense to invest in these technologies to enhance trading strategies.

Application

The first step to building a trading system that uses these powerful advancements is to develop a robust rule-based model for trading. Next, we identify parts of the model that can be improved using neural networks. Examples of viable applications are predicting a certain feature of the model a few bars in the future or pattern identification. Then, we must test to see how robust this feature is. That is, if we are predicting something three bars into the future, we need to examine the worst-case scenario if our “prediction” is instead three days late. We only want to predict elements which, if we fail, won’t cause the entire trading system to blow up.

Two important steps are defining exactly what we are going to predict and developing our pre-processing methodology. The best type of pre-processing is custom pre-processing developed using advanced statistical analysis and data mining. This type of analysis is time consuming and is one of the most expensive parts of building these models. A reasonable, simpler pre-processing alternative was developed by Mark Jurik and is known as “Level-0” and “Level-1” (see “On the level,” next page).

Other limitations are algorithm-based rather than mathematical. For example, kernel regression degrades when you have more than 12 inputs. This is not that big of a problem because methods such as principal component analysis (PCA) can be used to pre-process inputs to reduce dimensionalities. PCA is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables. These values then are called “principal components.” The number of principal components is less than or equal to the number of original variables. 

Other issues deal with how the inputs are normalized and their distribution. For example, in the case of neural networks, when using the tanh() function, we want to make sure that our data distribution is not concentrated at the extreme ends of the set. If so, the neural network can’t learn effectively. The same is true for various kernels used in kernel regression functions.

Pre-processing design is critical because the models are black boxes and statistical artifacts could control them easily. This refers to apparent cause-and-effect relationships that are erroneous because they don’t have a valid cause and effect. An example can be seen in Nasdaq data over the 1998-2002 period. You may have observed an exceptionally high correlation between the difference between the close and the price two days later, and today’s price minus the price four days ago. However, a model based on this relationship would have failed within a few months following the test period because the market started moving sideways; the relationship was only present during parabolic up and down environments.

Error distribution is another factor to consider. Two different models with the same root mean square error could have opposite profit profiles depending on how the errors are distributed. Predicting turning points well makes most of the money for these models. If we don’t predict turning points correctly by keeping values around zero, or by being wrong on large moves and right on small ones, it is possible that a network that does well on large moves but is wrong on smaller moves could make more money even though it has a bigger statistical error.

In this article, we’ve traced the modern history of neural network application in the markets and provided an overview of current accepted applications of the technologies. The next step is to build on this overview and lay out a real example applying neural networks as a component of a profitable trading system. Finally, we will discuss how new technological advances could bring the full promise of neural nets back into focus.

On the level

Mark Jurik’s data pre-processing method assumes the price series is formed by cycles of different frequencies. If data were sampled at different frequencies, the samples would carry all the information in the series.

To do this, we use sample blocks of data. If a block is further in the past, it is spaced further from the next block and is larger. The index determines how far back in time the center of the block is situated. This index is chosen such that it covers the period between the consecutive blocks. The indexes are provided as shown:

Row 1 = n and Row 2 = m
Row 1 = 1 2 3 4 5 7 9 13 17 25 33 49 65 97 129 193 257 385
Row 2 = 0 0 0 0 0 2 2 4 4 8 8 16 16 32 32 64 64 128

This strategy provides the neural network with the information it needs to look back in time without sampling every bar. For example, if we believe the price of gold affects the 10-year Treasury note for up to 50 bars, we would use the sample for row “n” and there would not be 50 columns of inputs. We would sample the first five days, and then our samples would become further apart because the further separated samples are trying to put up longer-term cycles. If we are trying to find a 30-day cycle, it can be reproduced sampling every five days without needing to sample every day.

Level-0 features are the normalization of price and the exponential moving average of price. These are sampled using row n in this table. Level-1 features are normalized price change relative to a block moving average.

Level 0 feature formula:

Level 1 feature formula:

This canned pre-processing works well for predicting moving average oscillators when you include past values of the output target sampled as well. 

Developing custom pre-processing requires a deep domain expertise as well as data mining skills. A reasonable test is to display them against our target in scatter charts. We look for these scatter charts to show patterns, either linear or non-linear shapes. What we don’t want to see is a defined blob.

We also will use approximation paradigms such as rough sets and machine learning algorithms such as C4.5 that can judge the information content of the pre-processing we develop.

Murray A. Ruggiero Jr. is the author of “Cybernetic Trading Strategies” (Wiley). E-mail him at ruggieroassoc@aol.com.

Page 4 of 5
Comments
comments powered by Disqus

eNewsletter Signup

Get the latest news and timely trading strategies for stock, options, forex, commodity, and financial derivatives markets with Futures' Daily Market Focus - FREE!