From the April 01, 2006 issue of Futures Magazine • Subscribe!

Best practices for statistical trading

Valid statistical analysis requires considerable care at every stage of the process. You can’t take shortcuts and you can’t make assumptions. Thankfully, you can reasonably achieve this.

Our goal is to develop a standard model to forecast the average fall price of new-crop soybeans. To do so, we will use standard multiple regression analysis to assign weights, or correlation coefficients, to the values of independent variables that, we suspect, will correlate to the average fall price of soybeans.

For the average fall price, which we refer to as the dependent variable, we are using the average price from mid August through expiration of the November soybean contract. Using this time frame allows us to glean our fundamental data from the mid August World Agriculture Supply and Demand Estimate (Wasde) reports.

And why we have to do that brings us to the first best practice that we’ll adhere to in our analysis.

A FEW GOOD RULES

Some statistical models are designed to explain, but we’re interested in models that forecast, so when we analyze the past we want to compare the average price of November soybeans from mid August on to data that was available before mid August. In other words, we will model price vs. expectations of what the fundamentals will be, not what the fundamentals ultimately were.

Past forecasts of fundamental data are not as readily available as the final revised numbers, but they are out there.

Next, and even more important than modeling expectations, is the need to hold back part of our data as “out-of-sample,” to avoid curve fitting.

In our case, we are beginning our analysis with the 1976 crop year. We will end the in-sample data set with 2000. Our out-of-sample validation set will be 2001 through 2005.

Third, we will account for inflation by adjusting past prices according to the producer price index (PPI) before we examine the effect, if any, of the selected independent variables. This also means that as we apply our model going forward, the results will have to be adjusted by the most recent value of the PPI. The PPI is a gauge of inflation calculated by the Bureau of Labor Statistics.

Fourth, we will look for independent variables that have a linear relationship with the dependent variable. We want the fundamental relationship to be stable through time. If a 10% change in yield per acre affected prices by 50¢ in 1980, we want to see the same relationship in 1993. We do not want to see a relationship that changes in its significance. The reason is simple. Without manipulating the variables themselves, standard multiple regression analysis does not result in valid models if the relationships are not linear.

Fifth, our model must not exhibit the three problems that often plague multiple regression analysis: multicolinearity, heteroscedasticity and autocorrelation. We’ll explain these terms later.

WHAT MOVES BEANS?

The fundamental drivers of the soybean market don’t have to be complicated. We will look for our independent variables in past Wasde reports. This monthly report provides the most current U.S. Department of Agriculture forecasts of U.S. and world supply-use balances of major grains, soybeans and cotton, as well as the U.S. supply and ADING use of sugar and livestock. You can find the actual numbers from past Wasde reports (not final revised figures) at: http://jan.mannlib.cornell.edu/ data-sets/crops/95501 (prior to 1995) and http://jan.mannlib.cornell.edu/ reports/waobr/wasde-bb (after 1995).

Current Wasde reports can be downloaded off the USDA’s Web site.

The variables we’re interested in are annual forecasted soybean production, the forecasted soybean usage/ending stocks ratio, forecasted soybean crushings, forecasted soybean yield and the forecasted corn usage/ending stocks ratio. We’ll look at corn data because farmers often have to choose between planting corn or soybeans. While good growing years equally affect corn and soybeans, the data will benefit our model.

Most data that you’ll examine for relationships with your dependent variable will be constant data streams that ebb and flow through time. But there’s another type of independent variable that can be useful: a dummy variable, which simply assigns a “1” to a period when it’s valid and a “0” to a period when it’s not valid. Dummy variables are useful when a market shock — say a war, a government mandated price freeze — might have affected the market.

One shock that has affected all U.S. crops in the past is El Nino. El Nino is a naturally occurring cycle of a general warming and cooling in the Tropical Pacific that can have significant implications for weather in many other places of the world. Some of the effect of El Nino is psychological and not reflected in fundamental forecasts.

By examining historical sea-surface temperatures (SST) of the South Pacific we can determine which years to assign a “1” for an El Nino year (data are available at www.cpc.noaa.gov/data/indices/ sstoi.indices). The years 1983, 1988 and 1998 fit the bill with relatively high SST readings.

VARIABLE SELECTION

Sophisticated software available for selecting your models often don’t give you the chance to get a feel for the independent variable’s actual effect on the dependent variable. We’ll use two graphical tools to select which variables we’ll use for our sample model: scatter plots and a correlation matrix.

“Fundamental relationships” (below) shows us what average fall prices of November soybeans correspond to different levels in forecasts of our independent variables. Clearly, all five of the variables have some relationship with price.

For example, the forecasted soybean usage/ending stocks ratio clearly has a positive relationship. This is intuitively pleasing. The usage/ending stocks ratio is higher when the amount of soybeans forecasted to be used is higher than the amount of soybeans forecasted to be left over. Therefore, we would expect higher levels of the usage/ending stocks ratio to correspond to higher soybean prices.

A negative relationship is indicated by soybean production. Again, this is intuitive. More soybeans on the market suggest lower prices.

Now, shifting our attention to the correlation matrix, which you can find at www.futuresmag.com/additional_ copy/statistical.htm we can determine which variables won’t likely create the problem of multicolinearity. Most of the soybean fundamental data are correlated. When an acre yields more soybean bushels, more soybeans are obviously produced. When building a model with standard multiple regression analysis, we cannot use both of these variables to forecast the dependent variable.

However, as we can see in the correlation matrix, both forecasted soybean crushings and the forecasted corn usage/ending stocks ratio have some correlation to price, while neither seems to have much correlation to the other.

Along with the El Nino dummy variable, those are the independent variables that we’ll use in our model.

MAKING A MODEL

We covered the basic mathematics of regression analysis in the first part of this series. However, that was for a model that used just one variable. This model uses three. While the goal is the same — to find the weights, or correlation coefficients, that describe the effect of the independent variables on the dependent variable — the process for achieving that goal is far more complex. The only reasonable solution is to use technology to find the weights for us.

While there are many commercial software programs designed for standard multiple regression analysis, we suggest you use the regression analysis tools available in the spreadsheet software you likely already have.

We used Microsoft Excel to calculate our model and its performance statistics. (see www.futuresmag.com/additional_ copy/statistical.htm). The output is shown in “Output example” (below), while the manually created chart in “Past estimations” (below) shows the estimates for past prices along with the actual past (inflation-adjusted) prices.

According to Excel, this is our model:

SX = 13.082 + X1 * -0.00720 + X2 * 0.22446 + X3 * 1.50606543

Where: SX is the average fall price of November soybeans.

X1 is the forecasted soybean crushings from the mid-August Wasde report.\

X2 is the forecasted corn usage/ending stocks ratio from the mid-August Wasde report.

X3 is an El Nino dummy variable.

IS IT VALID?

To determine whether a standard multiple regression model is valid, we take a basic three-step approach. First, we look at numbers that indicate the model closely estimates the past data. Next, we look for problems of multicolinearity, heteroscedasticity and autocorrelation.

There are a few guidelines to follow to tell whether the model meets basic standards.

The F statistic tells us the significance of our model as a whole. Larger F statistics indicate greater significance. While there is a critical value of the F-statistic for each model, the significance level, which most regression analysis software provides, is a quicker reference. Simply, this number must be less than one minus the significance you seek, say, 95% or 0.95. Ours calculates out to 99.99886%, so that is certainly significant.

We look at the significance of each independent variable. This is telegraphed by the t-statistics. T-statistics tell you whether a variable adds anything to the model. To be significant, a variable must have a certain absolute t-statistic value. You can find this value in any statistics book. With 25 observations and three independent variables, our model’s variables must have t-statistics whose absolute values are greater than 1.72 to be significant. Looking at the model, we see this is the case, with t-statistics of –6.59, 2.95 and 1.99.

Another gauge is the model’s standard error. The smaller, the better. This figure always will be positive. For our model, the standard error is 1.20 for our inflation-adjusted dependent variables ranging from $3.47 to $10.90.

Finally, we have the multiple R^2 statistic. This figure tells us what proportion of the variance in the dependent variable is accounted for by all the independent variables combined. For example, our model’s multiple R^2 tells us that 83.5% of the variance in the average fall price of November soybeans is explained by the selected independent variables.

There are three typical problems in multiple regression analysis: heteroscedasticity, where errors exhibit different spreads for different values of the projected dependent variable; autocorrelation, where errors are correlated through time; and multicollinearity, where independent variables are correlated. If heteroscedasticity or autocorrelation is present, our independent variables may have been estimated inconsistently, we may be lacking critical variables, or relationships may not be linear. If multicollinearity is present, our model’s regression statistics are meaningless.

The easiest way to test for autocorrellation or heteroscedasticity is to examine a model’s residual plots. Residuals are our model’s errors. For example, if we predicted an average price of November soybeans of $5.54 per bushel for the fall of 1984, but the actual average was $5.99, our residual for 1984 was 45¢.

If our errors are random across consecutive predicted values of our dependent variables, then our errors are homoscedastic — they have a constant spread. Likewise, if the residuals appear random through time, then we likely don’t have the problem of autocorrelation. A more sophisticated test for autocorrelation is the Durbin-Watson test. A formula that objectively measures the amount of autocorrelation present in your model.

As for multicollinearity, the most common sign is if a variable’s correlation coefficient has the wrong sign. We know that soybean crushings and soybean prices were negatively correlated when examined individually in the scatter plots. Therefore, the coefficient generated for crushings in our multiple regression model should also be negative. Because none of our variables coefficients has the wrong sign — and, intuitively, they shouldn’t be correlated anyway — we’ll assume that we don’t have the problem of multicollinearity.

BACK THROUGH THE PAST

To test our model on the out-of-sample data, we simply input the values for the independent variables in 2001 through 2005, but we use the correlation coefficients that we calculated from the in sample data set. “Without hindsight” (below) shows how close to our model came to forecasting the average fall price of November soybeans using the original model values.

Two points should be clear from this exercise in standard regression analysis. One, statistical analysis does not need to be complicated. With freely available government reports, spreadsheet software and a refresher on your Statistics 101, you should be able to build a model that provides a reasonable forecast of future price levels. Second, if you wish to make things more complicated, there certainly are ways to do that — from data selection to software choices to even variable manipulation.

Of course, complexity introduces risk. There’s a critical line that divides whether you’re going to risk being a little wrong with a simple model all of the time, or risk being wrong a lot with a complex model once. Each trader needs to draw that line himself.

Comments

eNewsletter Signup

Get the latest news and timely trading strategies for stock, options, forex, commodity, and financial derivatives markets with Futures' Daily Market Focus - FREE!