Algorithm bias: A statistical review
The most general (and toughest) challenge faced by technical analysts is neither optimization (optimizing parameters is straightforward) nor overfitting (overfitting avoidance is an assumption), but how best to use domain knowledge to infer an appropriate bias in their algorithms. At the risk of oversimplifying, statistics generally concerns testing a given hypothesis, while machine learning concerns formulating the process of generalization as a search through possible hypotheses in an attempt to find the best hypothesis. Classical statistics involves calculating the probability of the data if the null hypothesis is true, while Bayesian inference involves calculating the probability of a hypothesis, given the data.
As traders in particular, and scientists in general, our aims are better aligned with the paradigms of Bayesian inference and machine learning than with
Consider this: Let B be background information, H a hypothesis and D data. Then P(H|B) is known as the prior, P(D|B) the probability of the data, P(D|HB) the likelihood, and P(H|DB) the posterior. The probabilities are famously related via Bayes’ theorem,
As there is no such thing as an absolute probability, for notational convenience we often omit B. As the denominator in Bayes’ theorem, P(D|B), is independent of H, when comparing hypotheses we can omit it and use:
In the 18th century, Hume (1740) pointed out that “even after the observation of the frequent or constant conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience.” More recently, and with increasing rigor, Mitchell (1980), Schaffer (1994) and Wolpert (1996) showed that bias-free learning is futile.
The important point is that one can never generalize beyond known data without making at least some assumptions. The no-free-lunch theorem for supervised machine learning (Wolpert 1996) states that in terms of off-training set error, there are no prior distinctions between learning algorithms. In particular, this implies that there is no free lunch for overfitting avoidance. In other words, we should only constrain our algorithm if this reflects our prior beliefs.
A model is a family of functions, or equivalently, a function is a particular parameter choice of a model. Model selection is the task of choosing a model with the correct inductive bias, which in practice means selecting a model of optimal complexity for the given data. A more complex model will always fit the training data better, but may not represent the true underlying model and thus perform poorly on new data. Note that model selection (which is difficult) logically precedes parameter selection (which is well understood).
Below, I present a pedagogical example of Bayesian model selection, a method which in principle solves the overfitting problem and is originally due to the work of Sir Harold Jeffreys 70 years ago (Jeffreys 1939). The aim is to predict the daily British pound/U.S. dollar interbank rate. The data set spans Jan. 1, 1993, to Feb. 3, 2008, and consists of the average ask price for each day. Where the target (defined below) was zero return (weekends), the data were excluded. The training set consisted of 3,402 data points and the out-of-sample set included 1,701 data points.
For reasons of market efficiency, it is safest to take the view that there are no privileged features in financial time series, over and above keeping the inputs potentially relevant, orthogonal and using Tobler’s first law of geography that states that “everything is related to everything else, but near things are more related than distant things” (Tobler 1970).
Let p-n be the exchange rate n days ago. Consider five potential inputs:
x1 = log(p0/p-1),
x2 = log(p-1/p-3),
x3 = log(p-3/p-6),
x4 = log(p-6/p-13),
x5 = log(p-13/p-27)
and the target t = log(p1/p0).
Using these inputs, we now consider the following five linear models, mn, of increasing complexity, xn are collectively denoted by the vector x, amn are coefficients to be determined and collectively denoted by the vector a and y(x, a) is our estimate for t.
Returning to the second Bayes’ equation (page 38), substitute our models, mn, for H, as these are our hypotheses, while D is our training data. To obtain the posterior probability of a hypothesis, given the data, we require the prior and the likelihood.
First, consider the prior. Rather than setting a uniform prior across models, let us select a uniform prior across functions. In other words, let the prior be proportional to the volume in parameter space. To estimate this, consider the parameter range of integers between -5 and 5, inclusive, and calculate the volume of parameter space. Model 1 has two parameters, giving it a volume of 112, model 2 has three parameters, giving an associated volume of 113, and so on.
The efficient market hypothesis (Fama 1970) implies that we should expect an extremely simple anomaly to be discovered and traded out of the market. For this reason, overly simple functions seem less likely, so we shall penalize our simplest model. The priors are calculated below. c1 is a constant, so that the probabilities sum to 1.
We now require the marginal likelihood, which is the probability of the data, given the model with a random choice of parameters. The Bayesian information criterion (BIC) is a statistical criterion for model selection. It approximates the marginal likelihood, and is easy to calculate. We now turn to our training set and perform linear regression on each model described in the model equations (left) by regressing the target t on the inputs xn, and in each case obtain both the coefficients, amn, resulting in the optimally parameterized models (functions) shown below, and the residual sum of squares (RSS).
Where n is the number of data points, k the number of free parameters, BIC = n log RSS/n + k log(n) and (marginal) likelihood is proportional to e-0.5BIC. Figures are given in “Model likelihoods” (above).