From the March 01, 2006 issue of Futures Magazine • Subscribe!

Statistical analysis for traders

Fundamental analysis often creates some unique problems for traders. First, fundamental relationships are hard to quantify. Second, not all data are created equal. Third, it’s easy to make mistakes that render your fundamental conclusions invalid.

The solution to all of these problems, for those of us without advanced studies in non-linear signal processing and the like, is to keep things simple. One of the quickest ways to get into trouble making forecasts is to employ tools you don’t understand. Thankfully, there are simple, straightforward analysis methods that can be used on widely available fundamental data that you can use to forecast commodity prices.

Regression analysis is one such tool. Regression analysis objectively estimates past fundamental relationships to determine a standard relationship that can be used going forward. This method has been used for decades by researchers in all fields, including market analysis. It is nothing new, but it’s often ignored because it lacks the flare or promises of flashier techniques.

The most important aspect of regression analysis is not the application of the method itself, but on the setup. “A model relationship” (right) covers some of the math behind regression analysis, but it’s more important to understand the theory, assumptions and proper variable selection than the intricate algebraic steps of determining individual variable significance.

SUPPLY AND DEMAND

A product’s price is a function of supply and demand. Supply and demand, in turn, are functions of determinants such as production methods, weather, disposable income, tastes, etc.

This relationship may be written mathematically as:

P = ƒ(S, D), where

P=price

S=supply

D=demand

Regression analysis assigns specific numeric weights to the supply and demand determinants we plug into our actual equation. These specific determinants are called independent variables.

Here’s an example:

P(t) = a + b1 * S(t) + b2 * D(t) + e(t), where

a is a constant

b1= supply’s weight

b2 = demand’s weight

e(t) = an error term for point t

By inserting the supply and demand values for point t and solving the equation, we get an estimate for price. In this equation price is the dependent variable.

We use regression analysis to find the b1 and b2 weights and the constant figure. Using a computer — most spreadsheet software includes the tools you need to do this — we analyze past values for the dependent variables and independent variables to find these weights. These weights are also called the regression coefficients.

Unfortunately, we do not have fundamental reports that perfectly quantify supply and demand determinants. Also, many of the determinants can’t be quantified, such as poor worker morale affecting production or consumer tastes driving demand. Fortunately, these factors typically pale in significance compared to determinants such as carryover stocks or yield estimates, which are figures that are reliably estimated.

However, because we can’t model every determinant of supply and demand, our model will include error. That is, each prediction by our regression equation will vary from the actual values by some amount: the “e” in the regression equation above. This can’t be helped, but it can be minimized. The minimization process is explained in “A model relationship.”

But while we accept this error, we still need to keep it in check. For regression analysis to be valid, we must maintain a few assumptions regarding the error in our model and the variables we use to build that model.

ASSUMPTIONS

The assumptions of regression analysis must met for our data if we can trust our regression equation. Our model’s predictions will be worthless unless the data have certain important characteristics that are required for the math behind the regression model to work.

Linear relationships: Standard regression analysis assumes the dependent variable has a linear or “straight-line” relationship to the independent variable. That is, as the independent variable changes, the dependent variable changes at a constant rate. For example, if a one-unit change in the independent variable corresponds with a 10-unit change in the dependent variable, a two-unit change in the independent variable must correspond with a 20-unit change in the dependent variable.

Stationarity: The statistical properties of data, such as averages and correlations, must be stationary. That is, they must not change from one observation to another.

Normal statistical distribution: The mathematics of regression analysis assume that independent and dependent variables have “normal” distributions of values.

Uncorrelated independent variables: If more than one independent variable is modeled against the dependent variable, the independent variables can’t be correlated to each other. If they are, the significance of each variable to the dependent variable will not be clear, and the model may face a greater chance of breakdown.

A number of tests exist that will help determine if the above assumptions are satisfied enough so our regression equation is valid. Some tests simply provide a “yes” or “no” answer. Other tests return relative values that only make sense when compared to the same values for other equations. Basically, we are looking for random, normal distributions of our error figures. The error criteria are summarized by the right side chart in “Required assumptions” (below). In the second part of this series we will use some of these tests to analyze our sample regression equation.

WHAT TO MEASURE

One of the most important steps in building a regression model is deciding what, exactly, you are going to model; that is, the dependent variable. A lot of factors will go into determining your dependent variable. You will want something that should be logically affected by the independent variables that are available. You will want something that is easily identifiable in the past and will be easily identifiable in the future. You also don’t want to get too fancy: there is no sense in introducing more complexity into your model than necessary.

Of course, the dependent variable also should be something that can benefit your actual trading program. Don’t expect to be able to predict some magical figure that can serve as the beginning and end of your trading strategy, say, the closing price of a specific contract month on last trading day.

Don’t try to be a sniper when it comes to forecasting the future. Instead, aim broadly. Look for something that provides a general representation of market prices. One general representation of a market’s fundamental balance for a certain period is the average market price through that period. But if we accept that as our measure, we still need to determine the period and the market. The autumn harvest is one of the more important periods for several markets, and the new-crop contracts, such as the November soybean contract and the December corn contract, clearly should reflect the price determined by the supply and demand factors in effect during this period.

For our purposes, then, the average price of November soybeans fits these considerations nicely. The price of November soybeans is highly affected by new crop supply and demand determinants. This also is a liquid market with substantial commercial participation, and the fundamentals are well-reported by government agencies and wire services. To narrow our window of analysis even further, we’ll calculate our average with prices from mid-August through expiration. Mid-August falls right between plantings and harvest. So, we should be able to collect reliable information on the current season before the window effectively shuts.

As for the period to study, we should include as many observations, or cases, as possible to increase the statistical significance of our analysis. But, of course, the significance of fundamental factors changes through time. We want to study the largest number of cases as possible without including seasons that exhibit a price/fundamental relationship that is no longer valid. We also need to keep in mind the availability of past independent variables. November soybean data are available for half a century, but figures for past independent variables that are comparable to the same type of data available today are not.

One logical starting point is 1973. This was the first November contract traded after the dollar/gold link was severed. A reasonable argument would be this delinking represented a systemic change in the commodity markets that changes the degree by which the independent variables affect the dependent variable. However, in the few years after the de-linking, volatility in all commodities surged, which even a cursory glance at a historical price chart will confirm. As the volatility in those few subsequent years likely was caused by a factor that won’t be repeated, we’ll move our starting point up a bit and begin our analysis with the 1976 November soybean contract.

Next month, we’ll look at several possible dependent variables and discuss a few best practices that we’ll follow so our model will best represent the reality of how we’ll use it.

Comments

eNewsletter Signup

Get the latest news and timely trading strategies for stock, options, forex, commodity, and financial derivatives markets with Futures' Daily Market Focus - FREE!