The most general (and toughest) challenge faced by technical analysts is neither optimization (optimizing parameters is straightforward) nor overfitting (overfitting avoidance is an assumption), but how best to use domain knowledge to infer an appropriate bias in their algorithms. At the risk of oversimplifying, statistics generally concerns testing a given hypothesis, while machine learning concerns formulating the process of generalization as a search through possible hypotheses in an attempt to find the best hypothesis. Classical statistics involves calculating the probability of the data if the null hypothesis is true, while Bayesian inference involves calculating the probability of a hypothesis, given the data.
As traders in particular, and scientists in general, our aims are better aligned with the paradigms of Bayesian inference and machine learning than with
Consider this: Let B be background information, H a hypothesis and D data. Then P(H|B) is known as the prior, P(D|B) the probability of the data, P(D|HB) the likelihood, and P(H|DB) the posterior. The probabilities are famously related via Bayes’ theorem,
As there is no such thing as an absolute probability, for notational convenience we often omit B. As the denominator in Bayes’ theorem, P(D|B), is independent of H, when comparing hypotheses we can omit it and use:
In the 18th century, Hume (1740) pointed out that “even after the observation of the frequent or constant conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience.” More recently, and with increasing rigor, Mitchell (1980), Schaffer (1994) and Wolpert (1996) showed that bias-free learning is futile.
The important point is that one can never generalize beyond known data without making at least some assumptions. The no-free-lunch theorem for supervised machine learning (Wolpert 1996) states that in terms of off-training set error, there are no prior distinctions between learning algorithms. In particular, this implies that there is no free lunch for overfitting avoidance. In other words, we should only constrain our algorithm if this reflects our prior beliefs.
A model is a family of functions, or equivalently, a function is a particular parameter choice of a model. Model selection is the task of choosing a model with the correct inductive bias, which in practice means selecting a model of optimal complexity for the given data. A more complex model will always fit the training data better, but may not represent the true underlying model and thus perform poorly on new data. Note that model selection (which is difficult) logically precedes parameter selection (which is well understood).
Below, I present a pedagogical example of Bayesian model selection, a method which in principle solves the overfitting problem and is originally due to the work of Sir Harold Jeffreys 70 years ago (Jeffreys 1939). The aim is to predict the daily British pound/U.S. dollar interbank rate. The data set spans Jan. 1, 1993, to Feb. 3, 2008, and consists of the average ask price for each day. Where the target (defined below) was zero return (weekends), the data were excluded. The training set consisted of 3,402 data points and the out-of-sample set included 1,701 data points.
For reasons of market efficiency, it is safest to take the view that there are no privileged features in financial time series, over and above keeping the inputs potentially relevant, orthogonal and using Tobler’s first law of geography that states that “everything is related to everything else, but near things are more related than distant things” (Tobler 1970).
Let p-n be the exchange rate n days ago. Consider five potential inputs:
x1 = log(p0/p-1),
x2 = log(p-1/p-3),
x3 = log(p-3/p-6),
x4 = log(p-6/p-13),
x5 = log(p-13/p-27)
and the target t = log(p1/p0).
Using these inputs, we now consider the following five linear models, mn, of increasing complexity, xn are collectively denoted by the vector x, amn are coefficients to be determined and collectively denoted by the vector a and y(x, a) is our estimate for t.
Returning to the second Bayes’ equation (page 38), substitute our models, mn, for H, as these are our hypotheses, while D is our training data. To obtain the posterior probability of a hypothesis, given the data, we require the prior and the likelihood.
First, consider the prior. Rather than setting a uniform prior across models, let us select a uniform prior across functions. In other words, let the prior be proportional to the volume in parameter space. To estimate this, consider the parameter range of integers between -5 and 5, inclusive, and calculate the volume of parameter space. Model 1 has two parameters, giving it a volume of 112, model 2 has three parameters, giving an associated volume of 113, and so on.
The efficient market hypothesis (Fama 1970) implies that we should expect an extremely simple anomaly to be discovered and traded out of the market. For this reason, overly simple functions seem less likely, so we shall penalize our simplest model. The priors are calculated below. c1 is a constant, so that the probabilities sum to 1.
We now require the marginal likelihood, which is the probability of the data, given the model with a random choice of parameters. The Bayesian information criterion (BIC) is a statistical criterion for model selection. It approximates the marginal likelihood, and is easy to calculate. We now turn to our training set and perform linear regression on each model described in the model equations (left) by regressing the target t on the inputs xn, and in each case obtain both the coefficients, amn, resulting in the optimally parameterized models (functions) shown below, and the residual sum of squares (RSS).
Where n is the number of data points, k the number of free parameters, BIC = n log RSS/n + k log(n) and (marginal) likelihood is proportional to e-0.5BIC. Figures are given in “Model likelihoods” (above).
Finally, combining the priors determined in the second set of equations and the likelihoods in “Model likelihoods,” we can calculate the posterior probability of each model, given the data. Recall that P(model|data) is proportional to prior × likelihood. The calculations are given in the equations below, where c2 is a constant to ensure that the posterior probabilities sum to one.
Given the data, and our assumptions (in the prior), the equations above show us that the second and third models are the most probable, with the third model having the highest posterior probability. The application of the third model parameterized as in the regression equations to our out-of-sample data generates 14.24% per annum before costs, as shown in “Model profits”.
We chose the most probable model, but we can do better than that. It is optimal to take an average over all models, with each model’s prediction weighted by its posterior probability. This is known as Bayesian model averaging, and increases our profit to 15.65% return per annum, before costs, as shown in “Model profits.”
OUT OF SAMPLE
The trading community typically worries about avoiding overfitting and statistical significance, but our practical successes have been due to the appropriate application of bias. Everyone should be a Bayesian, and use domain knowledge to make intelligent and explicit assumptions and adhere to the rules of probability, because how aligned your learning algorithm is with the domain determines how well you will generalize.
What does this mean in practice? A market price is generated by a non-stationary, stochastic, discontinuous and probably non-linear dynamic process, and any useful (that is, profitable) signal is extremely noisy. The resulting time series approximates a martingale, which makes prediction extremely difficult.
A profitable trading system is therefore surprising, and as such requires ample evidence. Moreover, as outlined above, the efficient market hypothesis implies that a simple signal should not persist, while the low signal-to-noise ratio dictates that a complex signal should not arise. We should therefore seek models of intermediate complexity.
Martin Sewell is a senior research associate at the University of Cambridge.