Prototype 5

How well does the Model Fit the data?

The basic idea of this concept is to make a prediction about the data (or anything in fact that can be turned into data). You will see later how model or fit can be applied to this concept. It is the prediction compared to the actual obtained scores. The mean can be used as a prediction. For example, you might be asked to guess how much Fred weighs. If that is all the information you have your best guess would be the average weight of men. One the other hand if you also knew how tall Fred was then your guess could be much improved. Such improvement is the focus of this section. The prototype will be the regression line. It is the basis of the general linear model.


To make this prediction we need a straight line that passes closest to all of the points. In Box G it is easy to find a line that would pass closest to all of the points. In fact the line can pass through all the points.

In Box H it is not as clear where to draw a line that would pass through all of the points.

Box I is similar in that one does not quite know where to draw a line that will be the closest to all of the points in the box.


One way to make the assessment would be to measure the distance from each point and add up those distances and then draw a new line a make the measurements again and repeat the procedure until one found the line that would result in the shortest measures. There is a mathematical way to find the solution called the method of least squares. The points of pairs of numbers can be plotted by having one set of measures plotted vertically (y axis) and one set of numbers plotted horizontally (x axis). Two numbers are needed to identify where the line should be drawn: (1) the slope of the line and (2) where to begin the line.

The slope of the line (for predicting y when x is known) is determined as:


The convention in statistics is that x variables are predictors and y variables are the criterion or predicted variables, we will use that convention.

The second characteristic that is needed is where to start or the intercept of y when x is 0. Or what is the value of y when x is 0. It is the mean of y minus the slope times the mean of x. The formula is:

Using the results of these two formulae we can now plot the regression line. In order to keep use connected to the task of learning to use the computer and SPSS the graph is generated from the SPSS package. The following set of data will be used in this example (you have seen it before).

This regression can now be plotted as a regression that is the line that comes closest to the points of the scatterplot. The SPSS program will plot everything but the regression as seen in the following Figure. The following syntax file will produce a plot that will include everything but the regression --that has been drawn in for ourt purposes.


Plots of the data might be helpful in representing Prototype # 5. You can get those in a crude from the SPSS program (not that SPSS is crude). The following is a syntax file that will generate the plot needed:

The following is the produced.

The next plot is the same plot that contains further explanation of the data points.


The next plot has been further modified to show the regression line as computed above. The regression line was drawn by starting at .3 on the Y axis when X was equal to 0 and incrementing .9 on the Y axis for each increment of 1 on the X axis. The formula use to generate the regression line was:

Y' = Y primed = a + (b times X).

The model is obtain in the following manner: (1) find a straight which passes closest to all of the points of the variables when they are plotted on the x and y axis. (2) Use this line to predict y scores from the x scores. (3) The difference between the predicted score and the actual score is the error. (4) Square each error score and sum the squares. (5) Compare the sum of squares error to the total sum of squares. The comparison will result in relationship of the variables or the fit. There are no new computations here -- it has all been done in the above example. Only the concept is added. The correlation itself indicates the fit. This is another way to conceptualize the relationship. It becomes useful in the conceptualization of complex multivariate statistics.


This sum of the differences (lines drawn from the regression line to the observed values) is the error in prediction: the degree to which the model does not fit the data. The error variance is actually the sums of the squares of the length of these lines.

The regression line is the line that will come closest to all of the observed values. If the lines drawn from the regression line to the observed values were added together is would be the smallest of the values for another possible line that could be drawn through the observed values. This graph represents Prototype # 5. The regression line is the prediction (or model) and the lines from the regression line to the actual data points is the error in prediction. This represents the fit of the model to the data.

The regression line can be generated in SPSS in the following manner:
Click on Graphs
Click on Scatter
Click on Simple
Click on Define
Select X variable
Click on the Delta Button to move the variable into the X-axis box
Select Y variable

Click on the Delta Button to move the variable into the X-axis box
Click OK
Double Click on the chart itself
Click Chart
Click Options
Click Total
Click OK