Regression, Prediction, and Model Building

© 1998 by Dr. Thomas W. MacFarland -- All Rights Reserved


************
reg_sion.doc
************
Background:  Statistical tests are used to carefully examine
             prior activities and to then use these analyses
             to make informed predictions about future 
             activities.  Regardless of the statistical test, 
             data are examined in a systematic manner so that 
             decisions can be made with some degree of certainty. 


             It is very common to use accepted data to offer a
             prediction of the future.  The opportunity of
             using existing data to predict future outcomes is
             viewed as model-building.  That is to say,
             existing data are used to build a model of the
             future, with a predetermined degree of error
             built into the model.  

             Multiple regression is a common and useful tool
             for model building.


Scenario:    This study will demonstrate how historical data on
             Math and Verbal SAT scores can be used to predict 
             University GPA. That is to say:

                 -- Provided that you know a student's Math
                    SAT score and Verbal SAT score,

                 -- Can you use these two scores to predict
                    this student's University GPA?

             This study will attempt to resolve the following
             equation:

             University GPA = Constant +or- (x * Math_SAT)
                                       +or- (y * Verb_SAT)

             Data are from the 105 students who graduated 
             from a local state university, earning a B.S. 
             in Computer Science.  Data on these students
             were previously identified in the tutorial on 
             the use of Pearson's Product-Moment Coefficient  
             of Correlation.  

             Because the data are all interval data (i.e., 
             the data are parametric, with the difference 
             between a 3.87 GPA and a 3.88 GPA equal to the 
             difference between a 4.03 GPA and a 4.04 GPA,
             Pearson's Coefficient of Correlation is the 
             correct test to determine the degree of    
             association between these variables.

             Note:  The data file 'reg_sion.dat' is an
             exact copy of the file used to conduct the
             Pearson's Coefficient of Correlation analysis.

             Data for this study are summarized in Table 1.


           Table 1

           Summary Statistics of Computer Science Graduates:
           High School Grade Point Average (High_GPA), Math 
           Scholastic Aptitude Test Score (Math_SAT), Verbal 
           Scholastic Aptitude Test Score (Verb_SAT), Computer 
           Science Grade Point Average (Comp_GPA), and Overall
           University Grade Point Average (Univ_GPA)

           ======================================================== 

           Student 
           Number  High_GPA  Math_SAT  Verb_SAT  Comp_GPA Univ_GPA
           --------------------------------------------------------    

           001       3.45      643       589       3.76      3.52
           002       2.78      558       512       2.87      2.91
           003       2.52      583       503       2.54      2.40
           004       3.67      685       602       3.83      3.47
           005       3.24      592       538       3.29      3.47
           006       2.10      562       486       2.64      2.37
           007       2.82      573       548       2.86      2.40
           008       2.36      559       536       2.03      2.24
           009       2.42      552       583       2.81      3.02
           010       3.51      617       591       3.41      3.32
           011       3.48      684       649       3.61      3.59
           012       2.14      568       592       2.48      2.54
           013       2.59      604       582       3.21      3.19
           014       3.46      619       624       3.52      3.71
           015       3.51      642       619       3.41      3.58
           016       3.68      683       642       3.52      3.40
           017       3.91      703       684       3.84      3.73
           018       3.72      712       652       3.64      3.49
           019       2.15      564       501       2.14      2.25
           020       2.48      557       549       2.21      2.37
           021       3.09      591       584       3.17      3.29
           022       2.71      599       562       3.01      3.19
           023       2.46      607       619       3.17      3.28
           024       3.32      619       558       3.01      3.37
           025       3.61      700       721       3.72      3.61
           026       3.82      718       732       3.78      3.81
           027       2.64      580       538       2.51      2.40
           028       2.19      562       507       2.10      2.21
           029       3.34      683       648       3.21      3.58
           030       3.48      717       724       3.68      3.51
           031       3.56      701       714       3.48      3.62
           032       3.81      691       684       3.71      3.60
           033       3.92      714       706       3.81      3.65
           034       4.00      689       673       3.84      3.76
           035       2.52      554       507       2.09      2.27
           036       2.71      564       543       2.17      2.35 
           037       3.15      668       604       2.98      3.17
           038       3.22      691       662       3.28      3.47
           039       2.29      573       591       2.74      3.00
           040       2.03      568       517       2.19      2.74
           041       3.14      607       624       3.28      3.37
           042       3.52      651       683       3.68      3.54
           043       2.91      604       583       3.17      3.28
           044       2.83      560       542       3.17      3.39
           045       2.65      604       617       3.31      3.28
           046       2.41      574       548       3.07      3.19
           047       2.54      564       500       2.38      2.52
           048       2.66      607       528       2.94      3.08
           049       3.21      619       573       2.84      3.01
           050       3.34      647       608       3.17      3.42
           051       3.68      651       683       3.72      3.60
           052       2.84      571       543       2.17      2.40
           053       2.74      583       510       2.42      2.83
           054       2.71      554       538       2.49      2.38
           055       2.24      568       519       3.38      3.21
           056       2.48      574       602       2.07      2.24
           057       3.14      605       619       3.22      3.40
           058       2.83      591       584       2.71      3.07
           059       3.44      642       608       3.31      3.52
           060       2.89      608       573       3.28      3.47
           061       2.67      574       538       3.19      3.08
           062       3.24      643       607       3.24      3.38
           063       3.29      608       649       3.53      3.41
           064       3.87      709       688       3.72      3.64
           065       3.94      691       645       3.98      3.71
           066       3.42      667       583       3.09      3.01
           067       3.52      656       609       3.42      3.37
           068       2.24      554       542       2.07      2.34
           069       3.29      692       563       3.17      3.29
           070       3.41      684       672       3.51      3.40
           071       3.56      717       649       3.49      3.38
           072       3.61      712       708       3.51      3.28
           073       3.28      641       608       3.40      3.31
           074       3.21      675       632       3.38      3.42
           075       3.48      692       698       3.54      3.39
           076       3.62      684       609       3.48      3.51
           077       2.92      564       591       3.09      3.17
           078       2.81      554       509       3.14      3.20
           079       3.11      685       694       3.28      3.41
           080       3.28      671       609       3.41      3.29
           081       2.70      571       503       3.02      3.17
           082       2.62      582       591       2.97      3.12
           083       3.72      621       589       4.00      3.71
           084       3.42      651       642       3.34      3.50
           085       3.51      673       681       3.28      3.34
           086       3.28      651       640       3.32      3.48
           087       3.42      672       607       3.51      3.44
           088       3.90      591       587       3.68      3.59
           089       3.12      582       612       3.07      3.28
           090       2.83      609       555       2.78      3.00
           091       2.09      554       480       3.68      3.42
           092       3.17      612       590       3.30      3.41
           093       3.28      628       580       3.34      3.49
           094       3.02      567       602       3.17      3.28
           095       3.42      619       623       3.07      3.17
           096       3.06      691       683       3.19      3.24
           097       2.76      564       549       2.15      2.34
           098       3.19      650       684       3.11      3.28
           099       2.23      551       554       2.17      2.29
           100       2.48      568       541       2.14      2.08
           101       3.76      605       590       3.74      3.64
           102       3.49      692       683       3.27      3.42
           103       3.07      680       692       3.19      3.25
           104       2.19      617       503       2.98      2.76
           105       3.46      516       528       3.28      3.41
           --------------------------------------------------------    


Files:       1.  reg_sion.doc

             2.  reg_sion.dat

             3.  reg_sion.r01

             4.  reg_sion.o01

             5.  reg_sion.con

             6.  reg_sion.lis


Command:     At the Unix prompt (%), key:

             %spss -m < reg_sion.r01> reg_sion.o01


************
reg_sion.dat
************
           001       3.45      643       589       3.76      3.52
           002       2.78      558       512       2.87      2.91
           003       2.52      583       503       2.54      2.40
           004       3.67      685       602       3.83      3.47
           005       3.24      592       538       3.29      3.47
           006       2.10      562       486       2.64      2.37
           007       2.82      573       548       2.86      2.40
           008       2.36      559       536       2.03      2.24
           009       2.42      552       583       2.81      3.02
           010       3.51      617       591       3.41      3.32
           011       3.48      684       649       3.61      3.59
           012       2.14      568       592       2.48      2.54
           013       2.59      604       582       3.21      3.19
           014       3.46      619       624       3.52      3.71
           015       3.51      642       619       3.41      3.58
           016       3.68      683       642       3.52      3.40
           017       3.91      703       684       3.84      3.73
           018       3.72      712       652       3.64      3.49
           019       2.15      564       501       2.14      2.25
           020       2.48      557       549       2.21      2.37
           021       3.09      591       584       3.17      3.29
           022       2.71      599       562       3.01      3.19
           023       2.46      607       619       3.17      3.28
           024       3.32      619       558       3.01      3.37
           025       3.61      700       721       3.72      3.61
           026       3.82      718       732       3.78      3.81
           027       2.64      580       538       2.51      2.40
           028       2.19      562       507       2.10      2.21
           029       3.34      683       648       3.21      3.58
           030       3.48      717       724       3.68      3.51
           031       3.56      701       714       3.48      3.62
           032       3.81      691       684       3.71      3.60
           033       3.92      714       706       3.81      3.65
           034       4.00      689       673       3.84      3.76
           035       2.52      554       507       2.09      2.27
           036       2.71      564       543       2.17      2.35 
           037       3.15      668       604       2.98      3.17
           038       3.22      691       662       3.28      3.47
           039       2.29      573       591       2.74      3.00
           040       2.03      568       517       2.19      2.74
           041       3.14      607       624       3.28      3.37
           042       3.52      651       683       3.68      3.54
           043       2.91      604       583       3.17      3.28
           044       2.83      560       542       3.17      3.39
           045       2.65      604       617       3.31      3.28
           046       2.41      574       548       3.07      3.19
           047       2.54      564       500       2.38      2.52
           048       2.66      607       528       2.94      3.08
           049       3.21      619       573       2.84      3.01
           050       3.34      647       608       3.17      3.42
           051       3.68      651       683       3.72      3.60
           052       2.84      571       543       2.17      2.40
           053       2.74      583       510       2.42      2.83
           054       2.71      554       538       2.49      2.38
           055       2.24      568       519       3.38      3.21
           056       2.48      574       602       2.07      2.24
           057       3.14      605       619       3.22      3.40
           058       2.83      591       584       2.71      3.07
           059       3.44      642       608       3.31      3.52
           060       2.89      608       573       3.28      3.47
           061       2.67      574       538       3.19      3.08
           062       3.24      643       607       3.24      3.38
           063       3.29      608       649       3.53      3.41
           064       3.87      709       688       3.72      3.64
           065       3.94      691       645       3.98      3.71
           066       3.42      667       583       3.09      3.01
           067       3.52      656       609       3.42      3.37
           068       2.24      554       542       2.07      2.34
           069       3.29      692       563       3.17      3.29
           070       3.41      684       672       3.51      3.40
           071       3.56      717       649       3.49      3.38
           072       3.61      712       708       3.51      3.28
           073       3.28      641       608       3.40      3.31
           074       3.21      675       632       3.38      3.42
           075       3.48      692       698       3.54      3.39
           076       3.62      684       609       3.48      3.51
           077       2.92      564       591       3.09      3.17
           078       2.81      554       509       3.14      3.20
           079       3.11      685       694       3.28      3.41
           080       3.28      671       609       3.41      3.29
           081       2.70      571       503       3.02      3.17
           082       2.62      582       591       2.97      3.12
           083       3.72      621       589       4.00      3.71
           084       3.42      651       642       3.34      3.50
           085       3.51      673       681       3.28      3.34
           086       3.28      651       640       3.32      3.48
           087       3.42      672       607       3.51      3.44
           088       3.90      591       587       3.68      3.59
           089       3.12      582       612       3.07      3.28
           090       2.83      609       555       2.78      3.00
           091       2.09      554       480       3.68      3.42
           092       3.17      612       590       3.30      3.41
           093       3.28      628       580       3.34      3.49
           094       3.02      567       602       3.17      3.28
           095       3.42      619       623       3.07      3.17
           096       3.06      691       683       3.19      3.24
           097       2.76      564       549       2.15      2.34
           098       3.19      650       684       3.11      3.28
           099       2.23      551       554       2.17      2.29
           100       2.48      568       541       2.14      2.08
           101       3.76      605       590       3.74      3.64
           102       3.49      692       683       3.27      3.42
           103       3.07      680       692       3.19      3.25
           104       2.19      617       503       2.98      2.76
           105       3.46      516       528       3.28      3.41


************
reg_sion.r01
************
SET WIDTH      = 80
SET LENGTH     = NONE
SET CASE       = UPLOW
SET HEADER     = NO
TITLE          = Multiple Regression
COMMENT        = This file is used to build a regression model
                 for University Grade Point Average and SAT
                 scores.  That is to say:

                 -- Provided that you know a student's Math
                    SAT score and Verbal SAT score,

                 -- Can you use these two scores to predict
                    this student's University GPA?

                 Data are from the 105 students who graduated 
                 from a local state university, earning a B.S. 
                 in Computer Science.  As entering freshmen,
                 these students need a Math SAT score of 550    
                 or greater to major in Computer Science.

                 Because the data are all interval data (i.e., 
                 the data are parametric, with the difference 
                 between a 3.87 GPA and a 3.88 GPA equal to the 
                 difference between a 4.03 GPA and a 4.04 GPA,
                 Pearson's Coefficient of Correlation is the 
                 correct test to determine the degree of    
                 association between these variables.

                 Note:  The data file 'reg_sion.dat' is an
                 exact copy of the file used to conduct the
                 Pearson's Coefficient of Correlation analysis.
DATA LIST FILE = 'reg_sion.dat' FIXED
     / Stu_Code  12-14
       High_GPA  22-25
       Math_SAT  32-34
       Verb_SAT  42-44
       Comp_GPA  52-55
       Univ_GPA  62-65

FORMAT High_GPA (F4.2)
FORMAT Comp_GPA (F4.2)
FORMAT Univ_GPA (F4.2)

COMMENT          = By using the "FORMAT" command in this way, 
                   the three GPA scores are restricted to 
                   four columns, with the last two columns to
                   the right of the decimal point. 

Variable Labels
       Stu_Code   "Student Code"
     / High_GPA   "High School GPA"
     / Math_SAT   "Mathematics SAT Score"
     / Verb_SAT   "Verbal SAT Score"
     / Comp_GPA   "GPA in Computer Science Courses"
     / Univ_GPA   "GPA in All University Courses"

REGRESSION VARIABLES  = Univ_GPA   Math_SAT   Verb_SAT
     / DEPENDENT      = Univ_GPA 
     / METHOD         = ENTER

COMMENT               = Notice how Univ_GPA is declared as the
                        dependent variable.


************
reg_sion.o01
************
   1  SET WIDTH      = 80
   2  SET LENGTH     = NONE
   3  SET CASE       = UPLOW
   4  SET HEADER     = NO
   5  TITLE          = Multiple Regression
   6  COMMENT        = This file is used to build a regression model
   7                   for University Grade Point Average and SAT
   8                   scores.  That is to say:
   9
  10                   -- Provided that you know a student's Math
  11                      SAT score and Verbal SAT score,
  12
  13                   -- Can you use these two scores to predict
  14                      this student's University GPA?
  15
  16                   Data are from the 105 students who graduated
  17                   from a local state university, earning a B.S.
  18                   in Computer Science.  As entering freshmen,
  19                   these students need a Math SAT score of 550
  20                   or greater to major in Computer Science.
  21
  22                   Because the data are all interval data (i.e.,
  23                   the data are parametric, with the difference
  24                   between a 3.87 GPA and a 3.88 GPA equal to the
  25                   difference between a 4.03 GPA and a 4.04 GPA,
  26                   Pearson's Coefficient of Correlation is the
  27                   correct test to determine the degree of
  28                   association between these variables.
  29
  30                   Note:  The data file 'reg_sion.dat' is an
  31                   exact copy of the file used to conduct the
  32                   Pearson's Coefficient of Correlation analysis.
  33  DATA LIST FILE = 'reg_sion.dat' FIXED
  34       / Stu_Code  12-14
  35         High_GPA  22-25
  36         Math_SAT  32-34
  37         Verb_SAT  42-44
  38         Comp_GPA  52-55
  39         Univ_GPA  62-65
  40

This command will read 1 records from reg_sion.dat

Variable   Rec   Start     End         Format

STU_CODE     1      12      14         F3.0
HIGH_GPA     1      22      25         F4.0
MATH_SAT     1      32      34         F3.0
VERB_SAT     1      42      44         F3.0
COMP_GPA     1      52      55         F4.0
UNIV_GPA     1      62      65         F4.0

  41  FORMAT High_GPA (F4.2)
  42  FORMAT Comp_GPA (F4.2)
  43  FORMAT Univ_GPA (F4.2)
  44
  45  COMMENT          = By using the "FORMAT" command in this way,
  46                     the three GPA scores are restricted to
  47                     four columns, with the last two columns to
  48                     the right of the decimal point.
  49
  50  Variable Labels
  51         Stu_Code   "Student Code"
  52       / High_GPA   "High School GPA"
  53       / Math_SAT   "Mathematics SAT Score"
  54       / Verb_SAT   "Verbal SAT Score"
  55       / Comp_GPA   "GPA in Computer Science Courses"
  56       / Univ_GPA   "GPA in All University Courses"
  57
  58  REGRESSION VARIABLES  = Univ_GPA   Math_SAT   Verb_SAT
  59       / DEPENDENT      = Univ_GPA
  60       / METHOD         = ENTER
  61
  62  COMMENT               = Notice how Univ_GPA is declared as the
  63                          dependent variable.

     1404 bytes of memory required for REGRESSION procedure.
        0 more bytes may be needed for Residuals plots.




           * * * *   M U L T I P L E   R E G R E S S I O N   * * * *


Listwise Deletion of Missing Data

Equation Number 1    Dependent Variable..   UNIV_GPA   GPA in All
University
Cou

Block Number  1.  Method:  Enter


Variable(s) Entered on Step Number
   1..    VERB_SAT  Verbal SAT Score
   2..    MATH_SAT  Mathematics SAT Score


Multiple R           .68573
R Square             .47022
Adjusted R Square    .45983
Standard Error       .32867

Analysis of Variance
                    DF      Sum of Squares      Mean Square
Regression           2             9.77974          4.88987
Residual           102            11.01840           .10802

F =      45.26669       Signif F =  .0000


------------------ Variables in the Equation ------------------

Variable              B        SE B       Beta         T  Sig T

MATH_SAT        .003291     .001090    .395622     3.019  .0032
VERB_SAT        .002272  9.3082E-04    .319867     2.441  .0164
(Constant)     -.237534     .375038                -.633  .5279


End Block Number   1   All requested variables entered.


************
reg_sion.con
************
Outcome:     The following information from the SPSS output 
             file is used to develop the model, or the
             prediction equation: 

------------------ Variables in the Equation ------------------

Variable              B        SE B       Beta         T  Sig T

MATH_SAT        .003291     .001090    .395622     3.019  .0032
VERB_SAT        .002272  9.3082E-04    .319867     2.441  .0164
(Constant)     -.237534     .375038                -.633  .5279

             Although there is an abundance of information in
             this part of the SPSS printout, the following 
             parts of the printout are what you need to build
             the prediction equation:

             Univ_GPA = Constant +or- (x * Math_SAT) 
                                 +or- (y * Verb_SAT)

             Univ_GPA = -.237534 + (.003291 * Math_SAT)
                                 + (.002272 * Verb_SAT)

 
             It is always best to try a sample calculation to 
             see if the model is reasonable.  Imagine a student 
             with  with a Math_SAT of 650 and a Verb_SAT of 
             625.  Using the prediction formula for this study:

             Univ_GPA = -.237534 + (.003291 * Math_SAT)
                                 + (.002272 * Verb_SAT)

             Univ_GPA = -.237534 + (.003291 * 650) + (.002272 * 625)

             Univ_GPA = -.237534 + (2.13915) + (1.42)

             Univ_GPA = 3.32162

             And it is certainly reasonable to think that a 
             student with a Math_SAT score of 650 and a Verb_SAT
             score of 625 would graduate from university with a
             3.32162 (GPA = 4.0 is all A's).

             Again, always try a sample calculation to verify the
             accuracy of the model.

             As you will notice, the prediction equation is much
             easier to read in the attached MINITAB printout.


************
reg_sion.lis
************
% minitab

 MTB > outfile 'reg_sion.lis'
 Collecting Minitab session in file: reg_sion.lis
 MTB > # MINITAB addendum to 'reg_sion.dat'
 MTB > #
 MTB > read 'reg_sion.dat' c1 c2 c3 c4 c5 c6
 Entering data from file: reg_sion.dat
     105 rows read.
 MTB > name c1 'Stu_Code'
 MTB > name c2 'High_GPA'
 MTB > name c3 'Math_SAT'
 MTB > name c4 'Verb_SAT'
 MTB > name c5 'Comp_GPA'
 MTB > name c6 'Univ_GPA'
 MTB > #
 MTB > # Before I conduct the regression analysis, I like to 
 MTB > # plot the two predictor variables.
 MTB > #
 MTB > plot 'Math_SAT' 'Verb_SAT'
 
          -
          -                                     2       *  **  **
       700+                                            *     **
          -                    *      * *     **2 * ** 3***
  Math_SAT-                        *   2*   *         *
          -                             *     2        2
          -                         *  *3 *            *
       630+                       *
          -        *         **  *  3      2
          -             *     *  ****    *2*    *
          -        **     *        3*   *
          -        * **   3*2       2 2
       560+   **  222    *3*2       *
          -                  *     *
          -
          -             *
          -
          ------+---------+---------+---------+---------+---------+Verb_SAT
                500       550       600       650       700       750
 
 MTB > # And as you see, there is a generally positive association
 MTB > # between Verb_SAT and Math_SAT.
 MTB > #
 MTB > regress 'Univ_GPA' on 2 predictor variables 'Math_SAT' 'Verb_SAT'
 
 The regression equation is
 Univ_GPA = - 0.238 + 0.00329 Math_SAT + 0.00227 Verb_SAT
 
 Predictor       Coef       Stdev    t-ratio        p
 Constant     -0.2375      0.3750      -0.63    0.528
 Math_SAT    0.003291    0.001090       3.02    0.003
 Verb_SAT   0.0022718   0.0009308       2.44    0.016
 
 s = 0.3287      R-sq = 47.0%     R-sq(adj) = 46.0%
 
 Analysis of Variance
 
 SOURCE       DF          SS          MS         F        p
 Regression    2      9.7797      4.8899     45.27    0.000
 Error       102     11.0184      0.1080
 Total       104     20.7981
 
 SOURCE       DF      SEQ SS
 Math_SAT      1      9.1363
 Verb_SAT      1      0.6435
 
 Continue? y

 MTB > # Unlike SPSS, MINITAB actually prints out the regression
 MTB > # formula:
 MTB > #
 MTB > # Univ_GPA = - 0.238 + 0.00329 Math_SAT + 0.00227 Verb_SAT
 MTB > #
 MTB > # I will test this formula, using 650 on Math SAT and 589
 MTB > # for the Verbal SAT score:
 MTB > #
 MTB > # Univ_GPA = -0.238 + (0.00329 * 650) + (0.00227 * 589)
 MTB > # Univ_GPA = -0.238 + (2.1385) + (1.33703)
 MTB > # Univ_GPA =  3.23754
 MTB > #
 MTB > # And it is perfectly reasonable to expect a student with
 MTB > # a Math SAT score of 650 and a Verbal SAT score of 589 
 MTB > # to later achieve a University Grade Point Average of
 MTB > # approximately 3.24.
 MTB > #
 MTB > # As a note, you may want to look into the issue of
 MTB > # multicollinearity when determining which predictor
 MTB > # variables to select for the regression model.  But
 MTB > # this topic is beyond the scope of this tutorial.
 MTB > stop
             

--------------------------
Disclaimer:  All care was used to prepare the information in this 
tutorial.  Even so, the author does not and cannot guarantee the 
accuracy of this information.  The author disclaims any and all 
injury that may come about from the use of this tutorial.  As 
always, students and all others should check with their advisor(s) 
and/or other appropriate professionals for any and all assistance 
on research design, analysis, selected levels of significance, and 
interpretation of output file(s).

The author is entitled to exclusive distribution of this tutorial. 
Readers have permission to print this tutorial for individual use, 
provided that the copyright statement appears and that there is no 
redistribution of this tutorial without permission.

Prepared 980316
Revised  980914
end-of-file 'reg_sion.ssi'

Please send comments or suggestions to Dr. Thomas W. MacFarland

There have been [an error occurred while processing this directive] visitors to this page since February 1, 1999.