In the accompanying chart, soybean production is plotted against soybean yield for a 10-year sample. When the yield per acre is higher, so is total production. This is indicated in the chart by higher production on the y (vertical) axis corresponding to higher yield on the x (horizontal) axis.
Visually, it’s clear that this is a linear relationship. A relationship can be considered linear if the significance of the relationship between the variables does not increase or decrease at different variable values. Because this relationship is fairly “straight-line,” we can explain it with a regression equation.
While regression analysis assumes the dependent variable and the independent variable have a linear relationship, that does not mean all data points will fall exactly on the center line. Indeed, most, if not all, of the points will not be on the center line.
The distance that they are away from the line is the error value for that point. The standard error is a number that describes that error for the entire sample and should be random.
The technique that we’ll use for minimizing this error while still drawing a straight line is called least squares. As the name implies, the least-squares method results in a line that minimizes the sum of the squares of each point’s error value. The reason we minimize the sum of the squares of the errors and not just the errors themselves is because some of the raw error figures are negative, which would cancel out the positive errors when we added them, resulting in a summed error value of zero.
The steps for finding the equation of this line is relatively simple. The accompanying table demonstrates the initial calculations for finding the linear regression equation for a two-variable relationship.
First we add and calculate the average for both x and y. Then, we subtract the average for each group of variables from each value. In the fifth column, we multiply the difference between each value and its corresponding average by the same product for the other variable. In the sixth column we square the difference between the independent variables (x) and their average (the third column). Finally, we sum the fifth and sixth columns. Next, we just plug the sums of the fifth and sixth columns into the following equation to get the slope:
Then, we insert b and the average values for the independent and dependent variables into this equation to determine our intercept:
a = y - b * x
For our example, the calculations are:
b = 11,806.2 / 148.9
b = 79.3
and:
a = 2,124.5 - (79.3 * 35.1)
a = 2,124.5 - 2,783.4
a = -658.9
So, our regression equation is this:
y = -658.9 + 79.3 * x
This is the equation of the line in the chart. It approximates the relationship between yield per acre and total production. This tells us that for every one bushel per acre increase in yield, we can expect total production (in thousands) to increase by 79,300 bushel.
This equation can be used to estimate the production level for a specific year if we already know the yield figure for that year.
However, one variable usually isn’t enough in describing a relationship. We often need two or more to create an equation that is sufficiently reliable. But while the process for finding an equation for a two- or more variable relationship requires a computer, understanding the basic process provides a foundation for understanding multiple regression analysis.