Regression, Prediction, and Model Building
© 1998 by Dr. Thomas W. MacFarland -- All Rights Reserved
************
reg_sion.doc
************
Background: Statistical tests are used to carefully examine
prior activities and to then use these analyses
to make informed predictions about future
activities. Regardless of the statistical test,
data are examined in a systematic manner so that
decisions can be made with some degree of certainty.
It is very common to use accepted data to offer a
prediction of the future. The opportunity of
using existing data to predict future outcomes is
viewed as model-building. That is to say,
existing data are used to build a model of the
future, with a predetermined degree of error
built into the model.
Multiple regression is a common and useful tool
for model building.
Scenario: This study will demonstrate how historical data on
Math and Verbal SAT scores can be used to predict
University GPA. That is to say:
-- Provided that you know a student's Math
SAT score and Verbal SAT score,
-- Can you use these two scores to predict
this student's University GPA?
This study will attempt to resolve the following
equation:
University GPA = Constant +or- (x * Math_SAT)
+or- (y * Verb_SAT)
Data are from the 105 students who graduated
from a local state university, earning a B.S.
in Computer Science. Data on these students
were previously identified in the tutorial on
the use of Pearson's Product-Moment Coefficient
of Correlation.
Because the data are all interval data (i.e.,
the data are parametric, with the difference
between a 3.87 GPA and a 3.88 GPA equal to the
difference between a 4.03 GPA and a 4.04 GPA,
Pearson's Coefficient of Correlation is the
correct test to determine the degree of
association between these variables.
Note: The data file 'reg_sion.dat' is an
exact copy of the file used to conduct the
Pearson's Coefficient of Correlation analysis.
Data for this study are summarized in Table 1.
Table 1
Summary Statistics of Computer Science Graduates:
High School Grade Point Average (High_GPA), Math
Scholastic Aptitude Test Score (Math_SAT), Verbal
Scholastic Aptitude Test Score (Verb_SAT), Computer
Science Grade Point Average (Comp_GPA), and Overall
University Grade Point Average (Univ_GPA)
========================================================
Student
Number High_GPA Math_SAT Verb_SAT Comp_GPA Univ_GPA
--------------------------------------------------------
001 3.45 643 589 3.76 3.52
002 2.78 558 512 2.87 2.91
003 2.52 583 503 2.54 2.40
004 3.67 685 602 3.83 3.47
005 3.24 592 538 3.29 3.47
006 2.10 562 486 2.64 2.37
007 2.82 573 548 2.86 2.40
008 2.36 559 536 2.03 2.24
009 2.42 552 583 2.81 3.02
010 3.51 617 591 3.41 3.32
011 3.48 684 649 3.61 3.59
012 2.14 568 592 2.48 2.54
013 2.59 604 582 3.21 3.19
014 3.46 619 624 3.52 3.71
015 3.51 642 619 3.41 3.58
016 3.68 683 642 3.52 3.40
017 3.91 703 684 3.84 3.73
018 3.72 712 652 3.64 3.49
019 2.15 564 501 2.14 2.25
020 2.48 557 549 2.21 2.37
021 3.09 591 584 3.17 3.29
022 2.71 599 562 3.01 3.19
023 2.46 607 619 3.17 3.28
024 3.32 619 558 3.01 3.37
025 3.61 700 721 3.72 3.61
026 3.82 718 732 3.78 3.81
027 2.64 580 538 2.51 2.40
028 2.19 562 507 2.10 2.21
029 3.34 683 648 3.21 3.58
030 3.48 717 724 3.68 3.51
031 3.56 701 714 3.48 3.62
032 3.81 691 684 3.71 3.60
033 3.92 714 706 3.81 3.65
034 4.00 689 673 3.84 3.76
035 2.52 554 507 2.09 2.27
036 2.71 564 543 2.17 2.35
037 3.15 668 604 2.98 3.17
038 3.22 691 662 3.28 3.47
039 2.29 573 591 2.74 3.00
040 2.03 568 517 2.19 2.74
041 3.14 607 624 3.28 3.37
042 3.52 651 683 3.68 3.54
043 2.91 604 583 3.17 3.28
044 2.83 560 542 3.17 3.39
045 2.65 604 617 3.31 3.28
046 2.41 574 548 3.07 3.19
047 2.54 564 500 2.38 2.52
048 2.66 607 528 2.94 3.08
049 3.21 619 573 2.84 3.01
050 3.34 647 608 3.17 3.42
051 3.68 651 683 3.72 3.60
052 2.84 571 543 2.17 2.40
053 2.74 583 510 2.42 2.83
054 2.71 554 538 2.49 2.38
055 2.24 568 519 3.38 3.21
056 2.48 574 602 2.07 2.24
057 3.14 605 619 3.22 3.40
058 2.83 591 584 2.71 3.07
059 3.44 642 608 3.31 3.52
060 2.89 608 573 3.28 3.47
061 2.67 574 538 3.19 3.08
062 3.24 643 607 3.24 3.38
063 3.29 608 649 3.53 3.41
064 3.87 709 688 3.72 3.64
065 3.94 691 645 3.98 3.71
066 3.42 667 583 3.09 3.01
067 3.52 656 609 3.42 3.37
068 2.24 554 542 2.07 2.34
069 3.29 692 563 3.17 3.29
070 3.41 684 672 3.51 3.40
071 3.56 717 649 3.49 3.38
072 3.61 712 708 3.51 3.28
073 3.28 641 608 3.40 3.31
074 3.21 675 632 3.38 3.42
075 3.48 692 698 3.54 3.39
076 3.62 684 609 3.48 3.51
077 2.92 564 591 3.09 3.17
078 2.81 554 509 3.14 3.20
079 3.11 685 694 3.28 3.41
080 3.28 671 609 3.41 3.29
081 2.70 571 503 3.02 3.17
082 2.62 582 591 2.97 3.12
083 3.72 621 589 4.00 3.71
084 3.42 651 642 3.34 3.50
085 3.51 673 681 3.28 3.34
086 3.28 651 640 3.32 3.48
087 3.42 672 607 3.51 3.44
088 3.90 591 587 3.68 3.59
089 3.12 582 612 3.07 3.28
090 2.83 609 555 2.78 3.00
091 2.09 554 480 3.68 3.42
092 3.17 612 590 3.30 3.41
093 3.28 628 580 3.34 3.49
094 3.02 567 602 3.17 3.28
095 3.42 619 623 3.07 3.17
096 3.06 691 683 3.19 3.24
097 2.76 564 549 2.15 2.34
098 3.19 650 684 3.11 3.28
099 2.23 551 554 2.17 2.29
100 2.48 568 541 2.14 2.08
101 3.76 605 590 3.74 3.64
102 3.49 692 683 3.27 3.42
103 3.07 680 692 3.19 3.25
104 2.19 617 503 2.98 2.76
105 3.46 516 528 3.28 3.41
--------------------------------------------------------
Files: 1. reg_sion.doc
2. reg_sion.dat
3. reg_sion.r01
4. reg_sion.o01
5. reg_sion.con
6. reg_sion.lis
Command: At the Unix prompt (%), key:
%spss -m < reg_sion.r01> reg_sion.o01
************
reg_sion.dat
************
001 3.45 643 589 3.76 3.52
002 2.78 558 512 2.87 2.91
003 2.52 583 503 2.54 2.40
004 3.67 685 602 3.83 3.47
005 3.24 592 538 3.29 3.47
006 2.10 562 486 2.64 2.37
007 2.82 573 548 2.86 2.40
008 2.36 559 536 2.03 2.24
009 2.42 552 583 2.81 3.02
010 3.51 617 591 3.41 3.32
011 3.48 684 649 3.61 3.59
012 2.14 568 592 2.48 2.54
013 2.59 604 582 3.21 3.19
014 3.46 619 624 3.52 3.71
015 3.51 642 619 3.41 3.58
016 3.68 683 642 3.52 3.40
017 3.91 703 684 3.84 3.73
018 3.72 712 652 3.64 3.49
019 2.15 564 501 2.14 2.25
020 2.48 557 549 2.21 2.37
021 3.09 591 584 3.17 3.29
022 2.71 599 562 3.01 3.19
023 2.46 607 619 3.17 3.28
024 3.32 619 558 3.01 3.37
025 3.61 700 721 3.72 3.61
026 3.82 718 732 3.78 3.81
027 2.64 580 538 2.51 2.40
028 2.19 562 507 2.10 2.21
029 3.34 683 648 3.21 3.58
030 3.48 717 724 3.68 3.51
031 3.56 701 714 3.48 3.62
032 3.81 691 684 3.71 3.60
033 3.92 714 706 3.81 3.65
034 4.00 689 673 3.84 3.76
035 2.52 554 507 2.09 2.27
036 2.71 564 543 2.17 2.35
037 3.15 668 604 2.98 3.17
038 3.22 691 662 3.28 3.47
039 2.29 573 591 2.74 3.00
040 2.03 568 517 2.19 2.74
041 3.14 607 624 3.28 3.37
042 3.52 651 683 3.68 3.54
043 2.91 604 583 3.17 3.28
044 2.83 560 542 3.17 3.39
045 2.65 604 617 3.31 3.28
046 2.41 574 548 3.07 3.19
047 2.54 564 500 2.38 2.52
048 2.66 607 528 2.94 3.08
049 3.21 619 573 2.84 3.01
050 3.34 647 608 3.17 3.42
051 3.68 651 683 3.72 3.60
052 2.84 571 543 2.17 2.40
053 2.74 583 510 2.42 2.83
054 2.71 554 538 2.49 2.38
055 2.24 568 519 3.38 3.21
056 2.48 574 602 2.07 2.24
057 3.14 605 619 3.22 3.40
058 2.83 591 584 2.71 3.07
059 3.44 642 608 3.31 3.52
060 2.89 608 573 3.28 3.47
061 2.67 574 538 3.19 3.08
062 3.24 643 607 3.24 3.38
063 3.29 608 649 3.53 3.41
064 3.87 709 688 3.72 3.64
065 3.94 691 645 3.98 3.71
066 3.42 667 583 3.09 3.01
067 3.52 656 609 3.42 3.37
068 2.24 554 542 2.07 2.34
069 3.29 692 563 3.17 3.29
070 3.41 684 672 3.51 3.40
071 3.56 717 649 3.49 3.38
072 3.61 712 708 3.51 3.28
073 3.28 641 608 3.40 3.31
074 3.21 675 632 3.38 3.42
075 3.48 692 698 3.54 3.39
076 3.62 684 609 3.48 3.51
077 2.92 564 591 3.09 3.17
078 2.81 554 509 3.14 3.20
079 3.11 685 694 3.28 3.41
080 3.28 671 609 3.41 3.29
081 2.70 571 503 3.02 3.17
082 2.62 582 591 2.97 3.12
083 3.72 621 589 4.00 3.71
084 3.42 651 642 3.34 3.50
085 3.51 673 681 3.28 3.34
086 3.28 651 640 3.32 3.48
087 3.42 672 607 3.51 3.44
088 3.90 591 587 3.68 3.59
089 3.12 582 612 3.07 3.28
090 2.83 609 555 2.78 3.00
091 2.09 554 480 3.68 3.42
092 3.17 612 590 3.30 3.41
093 3.28 628 580 3.34 3.49
094 3.02 567 602 3.17 3.28
095 3.42 619 623 3.07 3.17
096 3.06 691 683 3.19 3.24
097 2.76 564 549 2.15 2.34
098 3.19 650 684 3.11 3.28
099 2.23 551 554 2.17 2.29
100 2.48 568 541 2.14 2.08
101 3.76 605 590 3.74 3.64
102 3.49 692 683 3.27 3.42
103 3.07 680 692 3.19 3.25
104 2.19 617 503 2.98 2.76
105 3.46 516 528 3.28 3.41
************
reg_sion.r01
************
SET WIDTH = 80
SET LENGTH = NONE
SET CASE = UPLOW
SET HEADER = NO
TITLE = Multiple Regression
COMMENT = This file is used to build a regression model
for University Grade Point Average and SAT
scores. That is to say:
-- Provided that you know a student's Math
SAT score and Verbal SAT score,
-- Can you use these two scores to predict
this student's University GPA?
Data are from the 105 students who graduated
from a local state university, earning a B.S.
in Computer Science. As entering freshmen,
these students need a Math SAT score of 550
or greater to major in Computer Science.
Because the data are all interval data (i.e.,
the data are parametric, with the difference
between a 3.87 GPA and a 3.88 GPA equal to the
difference between a 4.03 GPA and a 4.04 GPA,
Pearson's Coefficient of Correlation is the
correct test to determine the degree of
association between these variables.
Note: The data file 'reg_sion.dat' is an
exact copy of the file used to conduct the
Pearson's Coefficient of Correlation analysis.
DATA LIST FILE = 'reg_sion.dat' FIXED
/ Stu_Code 12-14
High_GPA 22-25
Math_SAT 32-34
Verb_SAT 42-44
Comp_GPA 52-55
Univ_GPA 62-65
FORMAT High_GPA (F4.2)
FORMAT Comp_GPA (F4.2)
FORMAT Univ_GPA (F4.2)
COMMENT = By using the "FORMAT" command in this way,
the three GPA scores are restricted to
four columns, with the last two columns to
the right of the decimal point.
Variable Labels
Stu_Code "Student Code"
/ High_GPA "High School GPA"
/ Math_SAT "Mathematics SAT Score"
/ Verb_SAT "Verbal SAT Score"
/ Comp_GPA "GPA in Computer Science Courses"
/ Univ_GPA "GPA in All University Courses"
REGRESSION VARIABLES = Univ_GPA Math_SAT Verb_SAT
/ DEPENDENT = Univ_GPA
/ METHOD = ENTER
COMMENT = Notice how Univ_GPA is declared as the
dependent variable.
************
reg_sion.o01
************
1 SET WIDTH = 80
2 SET LENGTH = NONE
3 SET CASE = UPLOW
4 SET HEADER = NO
5 TITLE = Multiple Regression
6 COMMENT = This file is used to build a regression model
7 for University Grade Point Average and SAT
8 scores. That is to say:
9
10 -- Provided that you know a student's Math
11 SAT score and Verbal SAT score,
12
13 -- Can you use these two scores to predict
14 this student's University GPA?
15
16 Data are from the 105 students who graduated
17 from a local state university, earning a B.S.
18 in Computer Science. As entering freshmen,
19 these students need a Math SAT score of 550
20 or greater to major in Computer Science.
21
22 Because the data are all interval data (i.e.,
23 the data are parametric, with the difference
24 between a 3.87 GPA and a 3.88 GPA equal to the
25 difference between a 4.03 GPA and a 4.04 GPA,
26 Pearson's Coefficient of Correlation is the
27 correct test to determine the degree of
28 association between these variables.
29
30 Note: The data file 'reg_sion.dat' is an
31 exact copy of the file used to conduct the
32 Pearson's Coefficient of Correlation analysis.
33 DATA LIST FILE = 'reg_sion.dat' FIXED
34 / Stu_Code 12-14
35 High_GPA 22-25
36 Math_SAT 32-34
37 Verb_SAT 42-44
38 Comp_GPA 52-55
39 Univ_GPA 62-65
40
This command will read 1 records from reg_sion.dat
Variable Rec Start End Format
STU_CODE 1 12 14 F3.0
HIGH_GPA 1 22 25 F4.0
MATH_SAT 1 32 34 F3.0
VERB_SAT 1 42 44 F3.0
COMP_GPA 1 52 55 F4.0
UNIV_GPA 1 62 65 F4.0
41 FORMAT High_GPA (F4.2)
42 FORMAT Comp_GPA (F4.2)
43 FORMAT Univ_GPA (F4.2)
44
45 COMMENT = By using the "FORMAT" command in this way,
46 the three GPA scores are restricted to
47 four columns, with the last two columns to
48 the right of the decimal point.
49
50 Variable Labels
51 Stu_Code "Student Code"
52 / High_GPA "High School GPA"
53 / Math_SAT "Mathematics SAT Score"
54 / Verb_SAT "Verbal SAT Score"
55 / Comp_GPA "GPA in Computer Science Courses"
56 / Univ_GPA "GPA in All University Courses"
57
58 REGRESSION VARIABLES = Univ_GPA Math_SAT Verb_SAT
59 / DEPENDENT = Univ_GPA
60 / METHOD = ENTER
61
62 COMMENT = Notice how Univ_GPA is declared as the
63 dependent variable.
1404 bytes of memory required for REGRESSION procedure.
0 more bytes may be needed for Residuals plots.
* * * * M U L T I P L E R E G R E S S I O N * * * *
Listwise Deletion of Missing Data
Equation Number 1 Dependent Variable.. UNIV_GPA GPA in All
University
Cou
Block Number 1. Method: Enter
Variable(s) Entered on Step Number
1.. VERB_SAT Verbal SAT Score
2.. MATH_SAT Mathematics SAT Score
Multiple R .68573
R Square .47022
Adjusted R Square .45983
Standard Error .32867
Analysis of Variance
DF Sum of Squares Mean Square
Regression 2 9.77974 4.88987
Residual 102 11.01840 .10802
F = 45.26669 Signif F = .0000
------------------ Variables in the Equation ------------------
Variable B SE B Beta T Sig T
MATH_SAT .003291 .001090 .395622 3.019 .0032
VERB_SAT .002272 9.3082E-04 .319867 2.441 .0164
(Constant) -.237534 .375038 -.633 .5279
End Block Number 1 All requested variables entered.
************
reg_sion.con
************
Outcome: The following information from the SPSS output
file is used to develop the model, or the
prediction equation:
------------------ Variables in the Equation ------------------
Variable B SE B Beta T Sig T
MATH_SAT .003291 .001090 .395622 3.019 .0032
VERB_SAT .002272 9.3082E-04 .319867 2.441 .0164
(Constant) -.237534 .375038 -.633 .5279
Although there is an abundance of information in
this part of the SPSS printout, the following
parts of the printout are what you need to build
the prediction equation:
Univ_GPA = Constant +or- (x * Math_SAT)
+or- (y * Verb_SAT)
Univ_GPA = -.237534 + (.003291 * Math_SAT)
+ (.002272 * Verb_SAT)
It is always best to try a sample calculation to
see if the model is reasonable. Imagine a student
with with a Math_SAT of 650 and a Verb_SAT of
625. Using the prediction formula for this study:
Univ_GPA = -.237534 + (.003291 * Math_SAT)
+ (.002272 * Verb_SAT)
Univ_GPA = -.237534 + (.003291 * 650) + (.002272 * 625)
Univ_GPA = -.237534 + (2.13915) + (1.42)
Univ_GPA = 3.32162
And it is certainly reasonable to think that a
student with a Math_SAT score of 650 and a Verb_SAT
score of 625 would graduate from university with a
3.32162 (GPA = 4.0 is all A's).
Again, always try a sample calculation to verify the
accuracy of the model.
As you will notice, the prediction equation is much
easier to read in the attached MINITAB printout.
************
reg_sion.lis
************
% minitab
MTB > outfile 'reg_sion.lis'
Collecting Minitab session in file: reg_sion.lis
MTB > # MINITAB addendum to 'reg_sion.dat'
MTB > #
MTB > read 'reg_sion.dat' c1 c2 c3 c4 c5 c6
Entering data from file: reg_sion.dat
105 rows read.
MTB > name c1 'Stu_Code'
MTB > name c2 'High_GPA'
MTB > name c3 'Math_SAT'
MTB > name c4 'Verb_SAT'
MTB > name c5 'Comp_GPA'
MTB > name c6 'Univ_GPA'
MTB > #
MTB > # Before I conduct the regression analysis, I like to
MTB > # plot the two predictor variables.
MTB > #
MTB > plot 'Math_SAT' 'Verb_SAT'
-
- 2 * ** **
700+ * **
- * * * **2 * ** 3***
Math_SAT- * 2* * *
- * 2 2
- * *3 * *
630+ *
- * ** * 3 2
- * * **** *2* *
- ** * 3* *
- * ** 3*2 2 2
560+ ** 222 *3*2 *
- * *
-
- *
-
------+---------+---------+---------+---------+---------+Verb_SAT
500 550 600 650 700 750
MTB > # And as you see, there is a generally positive association
MTB > # between Verb_SAT and Math_SAT.
MTB > #
MTB > regress 'Univ_GPA' on 2 predictor variables 'Math_SAT' 'Verb_SAT'
The regression equation is
Univ_GPA = - 0.238 + 0.00329 Math_SAT + 0.00227 Verb_SAT
Predictor Coef Stdev t-ratio p
Constant -0.2375 0.3750 -0.63 0.528
Math_SAT 0.003291 0.001090 3.02 0.003
Verb_SAT 0.0022718 0.0009308 2.44 0.016
s = 0.3287 R-sq = 47.0% R-sq(adj) = 46.0%
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 9.7797 4.8899 45.27 0.000
Error 102 11.0184 0.1080
Total 104 20.7981
SOURCE DF SEQ SS
Math_SAT 1 9.1363
Verb_SAT 1 0.6435
Continue? y
MTB > # Unlike SPSS, MINITAB actually prints out the regression
MTB > # formula:
MTB > #
MTB > # Univ_GPA = - 0.238 + 0.00329 Math_SAT + 0.00227 Verb_SAT
MTB > #
MTB > # I will test this formula, using 650 on Math SAT and 589
MTB > # for the Verbal SAT score:
MTB > #
MTB > # Univ_GPA = -0.238 + (0.00329 * 650) + (0.00227 * 589)
MTB > # Univ_GPA = -0.238 + (2.1385) + (1.33703)
MTB > # Univ_GPA = 3.23754
MTB > #
MTB > # And it is perfectly reasonable to expect a student with
MTB > # a Math SAT score of 650 and a Verbal SAT score of 589
MTB > # to later achieve a University Grade Point Average of
MTB > # approximately 3.24.
MTB > #
MTB > # As a note, you may want to look into the issue of
MTB > # multicollinearity when determining which predictor
MTB > # variables to select for the regression model. But
MTB > # this topic is beyond the scope of this tutorial.
MTB > stop
--------------------------
Disclaimer: All care was used to prepare the information in this
tutorial. Even so, the author does not and cannot guarantee the
accuracy of this information. The author disclaims any and all
injury that may come about from the use of this tutorial. As
always, students and all others should check with their advisor(s)
and/or other appropriate professionals for any and all assistance
on research design, analysis, selected levels of significance, and
interpretation of output file(s).
The author is entitled to exclusive distribution of this tutorial.
Readers have permission to print this tutorial for individual use,
provided that the copyright statement appears and that there is no
redistribution of this tutorial without permission.
Prepared 980316
Revised 980914
end-of-file 'reg_sion.ssi'