The 1950s through 1969 was the first golden age of neural networks. There was a lot of research in this area during this period. During the early 1960s, Bernard Widrow used two-layer perceptrons to solve many real-world problems such as short-range weather forecasting.

In 1969, Minksy and Papart’s book “The Perceptron” showed that these simple models could not solve non-linear separable problems that occur frequently in the real world. The simplest example of this class of problem is called the “exclusive ‘or’” problem. For example, “I can go to the movies or the mall, but I can’t do both.”

The truth table for such a statement reads as such:

** Mall Movies Outcome**

0 0 0

1 1 0

0 1 1

1 0 1

No matter how many layers a neural network of the time had, it could not solve this problem. The application math did not exist to minimize the error until the mid-1980s. Paul Werbos developed a method for training a three-layer network that could solve the exclusive “or” problem. David Rumelhart and Geoffrey Hinton created a generalized form of this algorithm, called back propagation, for multiple hidden layers. That’s when the neural network boom started.

This technique is a way of training a neural network. It is based on a simple idea. We input a vector of values into the network and we get some sort of output. That output is likely to be off somehow. We determine the cumulative error among the different output units. The output units are connected to the hidden units within the network.

We don’t know what the hidden units are supposed to do, but we can compute how fast the error changes as we change a hidden activity. We then can use error derivatives with respect to hidden activities. Each hidden activity can affect many output units and can, therefore, have many separate effects on the error. These effects are combined at output.

The back propagation algorithm goes through the hidden elements, computes the error derivatives and then adjusts the weight of each node to search for a minimum value of error across the training set.

**Variations**

There are many versions of this algorithm, but its core concept remains. Indeed, most computer science classes studying machine learning at major universities will still teach this as is. It is used to develop models today and can avoid local minimums. The most popular adjustment is now adding a momentum term.

Momentum simply adds a fraction of the previous weight update to the current one. When the gradient keeps pointing in the same direction, this will increase the size of the steps taken toward the minimum. It is therefore usually necessary to reduce the global learning rate µ when using a lot of momentum (when µ is close to 1).

The learning rate affects the speed at which the neural network arrives at the minimum solution. It is a common parameter in many of the learning algorithms. In back propagation, the learning rate is the step-size parameter. If the step-size parameter is too high, the system will often oscillate about the true solution because each adjustment is too big for it to arrive at the right answer. If the step-size is too low, you will eventually hit the right solution, but it will take a long time.

If you combine a high learning rate with a lot of momentum, you will breeze right past the minimum with huge steps and never truly converge on a solution.

Research in predicting the markets using back propagation is still going on. The main difference is that the configuration of the networks has more hidden nodes and the models are training for a longer number of epochs. These models need to be trained for 10,000 epochs and the number of hidden neurons is usually twice the number of inputs. In the 1990s this was almost impossible to do with the computing power available. Now that modern computers have multicore processing, tackling a problem such as this is not a big deal.

**Going both ways**

A standard feed-forward neural network computes its output values in a sequential manner. You have inputs, then you do some processing, then you have a set of outputs. A recurrent neural network has processing nodes that send output values forward and backward. This unique type of neural network can solve problems that regular network types cannot.

Several types of these networks exist. A simple recurrent neural network is called the Elman network, named after the researcher who published a paper explaining the idea in 1988.

Elman networks have connections from their hidden layer back to a special copy layer. Any function learned by the network can be based on multiple factors. It can be based on the current inputs as well as a record of the previous states and outputs of the network. Put another way, the Elman network is a finite state machine that learns which state to remember. The special copy layer is treated just as another input. This allows standard back propagation learning techniques to be used; in many other recurrent networks, this is not possible.

“Elman breakdown” (below) shows a common structure of an Elman network. Note the context layer that receives input from and also aids the hidden layer.

Jordan networks (named after researcher Michael I. Jordan) are similar in concept. The context units are fed from the output layer instead of the hidden layer. In a Jordan network, this layer is referred to as the state layer. They have a recurrent connection to themselves with no other nodes on this connection.

**Non-traditional training**

There is a unique constraint of the algorithms discussed. Back propagation and other similar classes of algorithms require an error function that is differentiable. Sometimes we want to train a model for which this is not the case.

One solution is genetic algorithms, which allow the use of a more appropriate error function. For example, when you train a back propagation network, you’re really training to root mean squared error (RMS). This does not give you the best trading results. Often, networks that have the lowest RMS error don’t map prices to the best distribution for profits. Using genetic algorithms, we can use output from a backtest as part of our training criteria.

Big data solutions to this problem use associative memories. This is one area in which neural network research has advanced. A simple associative memory can be implemented with a back propagation network. This algorithm is called “Autoencoder.” It’s trained to reproduce its input as output using fewer hidden nodes than inputs. In many respects, this functions as a data compression algorithm.

In “Three layers” (below), output[i] has edges back to input[i] for every “i.” Usually, it is possible to make this network so that the number of hidden units is much fewer than the number of input and output ones. As a result, when you pass data through the network, it winds up compressing the input to fit in a smaller representation. Then when you look for the output it tries to reconstruct it from the smaller hidden layer. The whole purpose of training is to minimize the chance of error on reconstruction; as such, these networks are not aimed at finding generalizations but rather finding efficient ways to compress (encode) data.

The Boltzmann machine is a type of stochastic recurrent network. It was first invented by Geoffrey Hinton and Terry Sejnowski in 1985. Boltzmann machines can represent and solve difficult problems dealing with finite or countable discrete structures. In theory, the Boltzmann machine would be a rather general computational medium. If we trained such a machine on photographs, it would be able to model the distribution of photographs and then we could use that model to, for example, complete a partial photograph.

Unfortunately, there is a serious problem with the Boltzmann machine. It stops learning correctly when the machine is scaled up to anything larger than trivial. The time the machine must be run grows exponentially with size, and connection strengths are more plastic when units have activation probabilities intermediate between zero and one.

The restricted Boltzmann machine (RBM) algorithm was invented slightly after the original one but did not catch on until the mid-2000s as it trained very slowly. However, the increased computational power of today’s computers has made this algorithm practical.

An RBM consists of one layer of visible units and one layer of hidden units. There are no connections between similar layers of units. Connections between units (neurons) are bidirectional and symmetric, meaning that information can flow both ways during the training and during the usage of the network. It also means that weights are the same in both directions (see “Making connections,” right).

A restricted Boltzmann machine uses a stochastic approach. To this extent, it uses stochastic units with a particular distribution. The learning procedure consists of several steps of sampling and adjusting the weights to minimize reconstruction error.

The intuition behind this concept is that there are some visible random variables and some hidden variables. For example, the visible variables might be film reviews from different users. The hidden variables might be film genres or other features. The task of training then is to find out how these two sets of variables are connected. This is a common problem in big data.

**Particle swarm optimization**

Particle swarm optimization (PSO) is a biologically inspired computational search and optimization method. Through the years, variations of this algorithm have improved the speed and quality of convergence on a solution by the original algorithm.

The original algorithm emulates the behavior of animal societies without a clear leader, such as birds flocking or fish schooling. The groups achieve their best condition simultaneously through communication among members who already have a better situation. The animal with the better situation will inform the flock and the others will move to the same place. This would happen repeatedly until the best conditions or a food source is discovered.

PSO attempts to emulate that process by having each particle in a swarm represent a potential solution. Classic PSO is best for simple problems, but a modified version of it can solve complex problems. These variations have helped expose the internal workings of the algorithm.

PSO is meta-heuristic in that it makes few to no assumptions about the problem that we would like to optimize. Therefore, it can search very large spaces of candidate solutions. However, meta-heuristics such as PSO do not guarantee that an optimal solution will ever be found. The optimization problem need not be differentiable as required by more classic methods (like the back propagation method we discussed earlier). PSO can, therefore, be used on optimization problems that are partially irregular, noisy, change over time or have other inconsistent characteristics.

**Consider the following scenario: **

A group of birds are searching for food in an area. There is only one piece of food in the area that is being searched. None of the birds knows where that food is. However, each bird knows how far the food is in each iteration. So what’s the best strategy to find food? The effective one is to follow the bird which is nearest to the food.

With PSO, each solution or particle is a “bird.” All of the particles have fitness values that are evaluated by the fitness function to be optimized and have different velocities that direct the flying of the particles. The particles fly through the problem space by following the current optimum particles.

The algorithm is initialized with a group of random particles (solutions) and then searches for optima by updating generations. In every iteration, each particle is updated by following two “best” values. The first one is the best solution that has been achieved so far. Another is the best value obtained so far by any particle in the population. When a particle takes part of the population as its topological neighbors, the best value is a local best.

After finding the two best values, the particle updates its velocity and positions with the following equation:

**Equation a) v[] = v[1] + c1 * rand() * (pbest[0] – present[0]) + c2 * rand() * (gbest[0] – present[0])**

and

**Equation b) Position[0] = Position[0] + v[0]**

In the above equations, v[] is the particle velocity. Position[0] is the current particle (solution). Pbest[0] is the particle best, and gbest[0] is the global best. Rand() is a random number between (0, 1), and

c1 and c2 are learning factors. In normal optimizations using this algorithm,

c1 and c2 usually are set to two.

**Genetic algorithms and PSO**

An obvious question might be, “Is PSO better than using genetic algorithms?” Most evolutionary algorithms have a similar structure.

- Random generation of an initial population.
- Deducing fitness value for each subject. This will directly depend on the distance to the optimal solution.
- Reproduction of the population based on said values.
- If the requirements are met, we’re done; if not, we return to step 2.

PSO shares many common points with genetic algorithms. Both algorithms start with randomly generated populations, and both have fitness values to evaluate the population. In addition, both update the population and search for an optimal solution using random techniques. Neither PSO nor genetic algorithms guarantee success.

However, PSO does not have genetic ideas like crossover or mutation. Particles update themselves with internal velocity. One thing PSO does have is memory, which is important to the algorithm.

Compared with genetic algorithms, information is shared in PSO much more differently. Genetic algorithms have chromosomes sharing information with each other. The whole population moves like one group toward the optimal solution. In PSO, only gBest or lBest gives out information to others. All particles thus tend to converge to the best solution quickly even in the local version in most cases. PSOs also are a great methodology for training neural networks.

**Hybrid solutions**

Back in 1991, I was on the first IEEE International Workshop on Artificial Intelligence for Industrial Applications dealing with hybrid systems. Only recently have hybrid solutions caught fire. I was expecting them to do so back in the 1990s.

Many of these concepts were once very prominent research areas in the 1990s, but had faded until recently. Many papers within the past five years combined neural networks with Box-Jenkins regression or linear regression, for example. The idea behind that is clever: We first make a prediction using the regression method and then use neural networks to predict the non-linear component.

Other types of hybrid solutions relate to hidden Markov models, which have been discussed in previous articles. They are used with neural networks either as post-processing for a network to improve or transform predictions, or to label or select inputs for a neural network.

Hybrids also may include training a neural network and then using a set of inputs and the network output as an input to a subsequent rule induction algorithm to produce rules from the network. This helps in many cases because rules do not have conflicts in the same ways as directly using the input dataset.

The problem with hybrid solutions always was related to computational horsepower. These solutions were always too computationally expensive to run realistically. With today’s advanced hardware, this restriction is much less than it was 20 years ago.

In our next installment, we expand some of these new concepts into real trading applications.