Statistics Tutorials Based on the Use of SPSS-XÔ and Minitab®
© 1998 by Dr. Thomas W. MacFarland -- All Rights Reserved
Purpose ------- These statistics tutorials have been developed to serve the needs of the following two groups of graduate students in the social sciences: 1. These tutorials should be useful for graduate students in the social sciences who have not yet enrolled in a statistics class. In this regard, these tutorials are viewed as a useful advance organizer for an academic area that is often viewed with a degree of apprehension by many students. 2. These tutorials should also be useful for graduate students in the social sciences who have completed their required statistics class(es), but who have not yet attempted data analysis for the thesis/dissertation. Indeed, the complexity of data analysis and the general area of statistics are far too often the reasons why there are too many graduate students with incomplete ABD (All-but- Dissertation) status. As designed, these tutorials and responses to individual questions should offer the level of help many graduate students need so that they can complete their research requirement(s) in a timely manner. Assumptions ----------- These tutorials are based on a set of assumptions that are common to today's contemporary graduate student in the social sciences: 1. Participants have a degree of experience with computing technology and file management, but by no means is expertise in these areas needed for success with these tutorials. 2. Participants have regular access to the Internet so that they can receive and send any electronic mail messages that may be related to this training activity. 3. Participants have access to SPSS (Statistical Package for the Social Sciences) and/or MINITAB. (Because it is so common in the social sciences, SPSS will be stressed). Ideally, participants will have access to the online versions of these two leading software packages, so that the full suite of options and commands are available. Even so, PC-based versions of these two software products should be sufficient for most participants. Some participants may have SAS, a widely-used statistical analysis program, available at their campus. SAS is excellent for large-scale programming jobs and complicated data warehousing and data mining activities. But, this level of complexity is beyond the scope of these introductory tutorials and instead these tutorials will only use SPSS and MINITAB for demonstration purposes. Limitations ----------- 1. Due to the complexities of multiple hardware and software platforms inherent to any distributed distance education activity, we will not have uniform access to the same computing machinery and software. When using these tutorials: -- Some participants may do most of their analyses online, using a dedicated terminal on a local area network to access a campus-based mainframe computer. -- Some participants may do most of their analyses online, working from home through modem-based connection to a campus-based super-mini computer. -- Other participants may prefer to work offline, using PC-based software. Be sure to know how to contact your local system operator or other "help" personnel for purely system-specific questions. 2. These tutorials are based on standard scenarios that may be reasonably found in the social sciences. For the purpose of consistency, all examples use problems that a classroom- based computing technology teacher would need to address. File Organization ----------------- Each tutorial is comprised of six separate files that have been joined into one common file: 1. topic.doc (doc = documentation/background information file) 2. topic.dat (dat = data file, with all data in FIXED FORMAT) 3. topic.r01 (r01 = SPSS "run" file 01) 4. topic.o01 (o01 = SPSS "output" file 01) 5. topic.con (con = conclusion and explanation file) 6. topic.lis (lis = list file for the MINITAB addendum) File Structure -------------- A few comments about each file type may be useful as you continue with this set of tutorials: 1. topic.doc (doc = documentation/background information file) Each documentation file describes a real-world problem faced by a computing technology teacher. Typically, I'll be sure to include background information about each statistical test, often describing examples of its correct selection and later interpretation. The data associated with the exercise are also found in this section. In most cases, the data table is then later used as the actual data file. Of course, there may be a few modifications to data in the table as the data file is prepared, depending on the complexity of how data are organized into sub-groups. 2. topic.dat (dat = data file) Data Organization ----------------- Each data file is organized so that it is in FIXED column format. You may occasionally work with data that are in sequential order, such that value_A comes before value_B, but their actual line number and column placement is not important. There may be a few occasions where this lack of column placement specificity is useful, but in most cases it is essential that you work with data that are organized in a precise format. Consider the following data file (cent_tnd.dat) that identifies final examination scores for a class of 23 students: 01 089 02 092 03 073 04 083 05 056 06 082 07 077 08 092 09 100 10 067 11 071 12 076 13 083 14 086 15 077 16 049 17 071 18 084 19 091 20 088 21 082 22 077 23 097 -- If you have 99 or fewer students, then be sure to allocate TWO columns for each student identification code: 01 to 99, found here in columns 20-21. -- If each test score has 100 has a maximum score, then be sure to allocate THREE columns for each student test score: 000 to 100, found here in columns 39-41. The following ten lines represent a data file prepared in response to a five-question survey (responses to each question range from 1 = Very Low Opinion to 5 = Very High Opinion), by ten survey participants: 01 3 4 2 5 5 02 4 5 2 5 5 03 4 4 4 5 5 04 5 5 5 3 5 05 4 4 5 3 5 06 4 5 3 4 1 07 5 5 5 5 2 08 2 4 3 2 2 09 5 4 3 4 5 10 4 4 4 5 3 As you examine this FIXED column format data file, be sure to notice how: -- Participant code number, ranging from 01 to 10, is placed (i.e., FIXED) in column 1 and column 2. Equally, notice how I used leading zeros for participants 01 to 09, keying in "01" instead of "1" for the first participant. I recommend that you develop the practice of keying in leading zeros also, to be sure that data are consistently placed in the correct column. -- Otherwise, notice how I keyed in the 1-5 response to question 1 in column 4 and then allowed a blank space before keying in the 1-5 response to question 2 in column 6, until finally the 1-5 response to question 5 is placed in column 12. White space (i.e., blank space) is ignored by most statistical analysis programs, provided data are declared in the proper column in the run file. The blank spaces in this example are purely for eye appeal and convenience in case you need to look closely at the file at a later time. Data File Construction ---------------------- Of course, there are many ways that you can use computing machinery and software to create a data file: -- When working offline, I usually use the old-fashioned DOS "edit" editor, since it conveniently displays the line number and column number in the lower right corner of the user interface. Another advantage of the DOS edit program is that it can only present and save data in a FIXED column format Courier font. This font is not only easy to see on the screen, but it fully supports the concept of FIXED column format. -- Other times, I may use WordPerfect, Word, or any of the other leading word processing packages as data-entry tools. When using a fully-embellished word processing package for data entry, just be sure to immediately change the font to Courier, a fixed width font. Equally, be sure to save the file in plain-text ASCII format. -- Of course, when working online at the UNIX shell (if you use this system configuration), the vi, pico, and emacs editors are useful data entry tools. -- In some cases, I'll also use a spreadsheet such as Lotus, Excel, or Quatro to enter data. However, I tend to avoid this type of data entry tool since leading zeros are usually deleted when spreadsheets are used. Again, this issue is critical when considering the issue of FIXED column format data files). Data File Hints --------------- When constructing a data file, your emphasis should be on accuracy. Here are a few hints that you may find useful: -- Consider the coding scheme that you plan to use for distinct groups, such as "Female" and "Male." It is nearly always easier to work with numbers instead of letters in a run (i.e., operation) file. For this scheme, it is often best to declare female as code = 1 and male as code = 2. By using this type of code, you avoid the problem of consistently remembering to use caps/no_caps for F/f and M/m. Some statistical analysis programs are case sensitive and inconsistency in capitalization could be a problem. Another advantage of this type of coding scheme is that it promotes accurate and rapid data entry. Look at the keyboard and notice how the F key and the M key are not immediately adjacent to each other. Then, notice how the 1 key and the 2 key are adjacent. The first time you have to key in responses to 1,234 56-question surveys, these time-saving coding schemes may become useful. -- With the possible exception of your advisor and anyone who may help you with data entry, no one is going to see your data file. The data file is merely a resource that supports your research. Keep the file in FIXED column format and use a font with fixed width, such as Courier, saving more attractive fonts for your actual thesis/ dissertation. To be brief, use an editor that best meets your needs for data entry, instead of the use of an editor that supports any desired style and presentation. I suggest that you practice with the DOS "edit" editor as a possible tool for data entry. As an example, you may want to practice with edit to create topic.dat as a sample data file. If so, key the following at the DOS C prompt: C:\> edit topic.dat You will then go into an interface that has on the top toolbar a set of options that you have surely seen when using other software programs: File Edit Search View Options Help The user interface of edit has the look and feel of other leading word processing programs and with a few practice trials you may find it very useful for data entry. -- To further promote accuracy, I suggest that you develop a template for each line of data, with an x placed in each column that will have a datum entered. Then, place your editor in insert mode and merely "overwrite" the x with the correct datum. For example, using the prior data file for responses to the five-question survey, I first prepared the following one line template of x characters: xx x x x x x Then, I used edit's Edit/Copy option and later used the Edit/Paste option nine times to create a ten line data file that now consists of a template with x characters in the column positions where data will be entered. xx x x x x x xx x x x x x xx x x x x x xx x x x x x xx x x x x x xx x x x x x xx x x x x x xx x x x x x xx x x x x x xx x x x x x Once the data file template was completed, I then put the editor into insert mode and typed over each x with the correct datum. Using this method, it is easy to visually scan your data file so that you maintain correct column placement when entering data. 3. topic.r01 (r01 = SPSS run file 01) Construction of a SPSS Run File ------------------------------- The way you prepare a run file is dependent upon the specific software program you use for statistical analysis. In these tutorials, I will prepare a complete run file using the online version of SPSS. I will then also include, as an addendum, an example of an interactive session of data analysis with MINITAB. The following 17 lines from cent_tnd.r01 represent a SPSS- based run file: SET WIDTH = 80 SET LENGTH = NONE SET CASE = UPLOW SET HEADER = NO TITLE = Descriptive Statistics and Central Tendency COMMENT = This file examines scores on a computing technology final examination DATA LIST FILE = 'cent_tnd.dat' FIXED / Stu_Code 20-21 Score 39-41 Variable Labels Stu_Code "Student Code" / Score "Exam Score " FREQUENCIES VARIABLES = Score / STATISTICS = All To show that this run file is perhaps not as cryptic as it possibly first appears, let me dissect each of these 17 lines: -- SET WIDTH = 80 is used in this case to set the width of the output file (cent_tnd.o01) to 80 columns wide. This option is used merely to support printing to 8-1/2 inch by 11 inch paper, as well as most computer screens. -- SET LENGTH = NONE is used to turn off page ejects, which promotes printing of the output file as one continuous page, reducing wasted paper. -- SET CASE = UPLOW is used to display labels in the output file to either upper case or lower case, as opposed to the default of upper case only. -- SET HEADER = NO is used to turn off the printing of a header on each page of the output file, which again reduces wasted paper/printing. -- TITLE = is used to provide a unique title at the beginning of the output file. -- COMMENT = is used to provide a descriptive comment any place within a run file. As with any programming activity, use comments liberally to help refresh your memory when you come back to the run file weeks or months later. -- DATA LIST FILE = 'cent_tnd.dat' FIXED / Stu_Code 20-21 Score 39-41 This portion of the run file can be broken down into its component parts: -- DATA LIST FILE = translates into "get the following file." -- 'cent_tnd.dat' is the file that the DATA LIST FILE is used to get. Be sure to notice how single quotation marks are placed before and after the file name. -- FIXED is used to declare that the data file 'cent_tnd.dat' is in fixed format. Be sure to review the need for FIXED column placement, as shown in previous examples. -- / Stu_Code 20-21 is used to declare that data for the variable Stu_Code (Student Code) are placed in column 20 and column 21. In turn, data for the variable Score (Final Examination Score) are found in columns 39 to 41. All other columns represent blank space and they are ignored since they were not declared in the run file. It is useful to know that the / character before Stu_Code CANNOT be placed in column 1. It is also useful to know that the variable name can be no longer than eight characters. Finally, it is best to avoid special characters in the eight-character variable names that you declare. I tend to start each variable name with a capital letter, but that practice is a preference and not a requirement. Otherwise, the only non-alphabetical or number character that I use in variable names is _, the underscore character, to "join" compound words into one name, such as Stu_Code for "Student Code." -- Variable Labels Stu_Code "Student Code" / Score "Exam Score " This portion of the run file can be broken down into its component parts: -- Variable Labels is used to give more descriptive labels to the previously declared eight-character variable codes. -- Stu_Code "Student Code" is used to provide a more descriptive label in the output file to the variable Stu_Code. The same logic applies to the other label. Again, it is useful to know that the / character before Score CANNOT be placed in column 1. It is also best to keep variable labels at 40 characters or less, avoiding special characters within the label. Be sure to notice how double quotation marks are placed before and after the variable label. -- FREQUENCIES VARIABLES = Score / STATISTICS = All This final portion of the run file can also be broken down into its component parts: -- The code FREQUENCIES VARIABLES = Score is the command to conduct a frequency analysis of the variable called "Score." By using this command, you will get a printout of how often each datum has been found. -- The code / STATISTICS = All is the command to conduct all available descriptive statistics on the variable named "Score." The leading descriptive statistics included in the output file are: -- N or the number of valid scores read into the program. -- Mode or the most frequently occurring score. -- Median or the mid-point of all scores placed in an array. -- Mean or the arithematic average of all scores. -- Std dev or the standard deviation of all scores. -- Range or the dispersion from the lowest score to the highest score. Batch Processing with SPSS vs. Interactive Processing ------------------------------ I prefer to work with SPSS in batch mode, where all files have been prepared in advance. This practice is the alternative to the interactive use of SPSS, where all analyses are keyed in real time. My preferences for batch processing are: -- As you progress through this series of tutorials, you will learn how to use and reuse complete SPSS run files and/or sections of SPSS run files. This concept is similar to modular programming, where you recycle modules or sections of a program in other programs. This practice reduces errors, while increasing efficiency. -- Interactive processing does not support the reuse of run files and/or sections of run files. Instead, everything is typically keyed over and over again, which wastes time while increasing the possibility of errors. Communicating Between a SPSS Run File and a Data File -------------------------------- I suggest that you always keep the data file separate from the run file, although there are options with SPSS for you to place the two files into one common file. The advantages of keeping the two files separate, with each file having a unique ending to the filename (.dat and .r01) are: -- As you obtain and then enter data, it is easier to have data and only data in the data file. -- Run files can be reused, with only slight modifications, for new analyses. It is messy if data that have nothing to do with the new analyses are found in the run file, requiring you to delete old data. The method you use to direct the run file to act upon the data file (Go back to the DATA LIST FILE = 'cent_tnd.dat' FIXED section of code to review this process.) depends on the unique way your operating system is organized. Again, I suggest that you check with your system operator to determine how your local environment is organized for the use of SPSS and/or MINITAB in batch mode. In this series of tutorials, I will present the online use of SPSS at a host computer using the UNIX operating system, working in batch mode: -- Nearly all colleges and universities that have a computing center will also likely have this common software program online and available to students. Of course, you may need to make a series of phone calls to finally get an online account at your institution. You may also need to obtain an information packet to learn if the software is available through menu selections or some other organized format. -- You may also choose to use a PC-based version of SPSS or MINITAB. Obviously, file setup will be totally dependent on how you organize directories and files. At the online system where these tutorials were developed, the % character is the UNIX operating system prompt. This set of tutorials was designed to have the run file communicate with the data file by using the following typical command when working at the % UNIX prompt: % spss -m < cent_tnd.r01 > cent_tnd.o01 Let's translate this command into Standard English: -- spss -m tells the system to prepare to use the SPSS data analysis and statistics software program in batch mode. -- < cent_tnd.r01 means that the program should "read in" the run file named cent_tnd.r01. -- > cent_tnd.o01 means that the program should direct the output of the SPSS session into the output file named cent_tnd.o01. By using this approach to batch data processing with SPSS: -- Output files are separate from the run files and the data file. -- If you plan to conduct multiple analyses of the cent_tnd.dat data file, then you only need to modify the run file(s) and use the numbering sequence of 01 to 99 (cent_tnd.r01, cent_tnd.r02, cent_tnd.r03, etc.) to conveniently rename them. 4. topic.o01 (o01 = SPSS output file 01) Interpretation of a SPSS Output File ------------------------------------ A SPSS output file (cent_tnd.o01 in this set of examples) begins by repeating the organization of the run file. After this beginning section is presented, the output file then displays the analyses requested in the series of commands included in the run file. The following 61-line file is presented as a sample SPSS-based output file: -- The first 27 lines are a repeat of the run file, with additional information about the format of each variable. Notice how the variable Stu_Code is in FIXED format, requiring two columns and Score is also in FIXED format, requiring three columns. 1 SET WIDTH = 80 2 SET LENGTH = NONE 3 SET CASE = UPLOW 4 SET HEADER = NO 5 TITLE = Descriptive Statistics and Central Tendency 6 COMMENT = This file examines scores on a computing 7 technology final examination 8 DATA LIST FILE = 'cent_tnd.dat' FIXED 9 / Stu_Code 20-21 10 Score 39-41 11 This command will read 1 records from cent_tnd.dat Variable Rec Start End Format STU_CODE 1 20 21 F2.0 SCORE 1 39 41 F3.0 12 Variable Labels 13 Stu_Code "Student Code" 14 / Score "Exam Score " 15 16 17 FREQUENCIES VARIABLES = Score 18 / STATISTICS = All -- The next set of 27 lines show a frequency distribution of the variable Score. To interpret this part of the output file, be sure to notice how there was one occurrence of the score "49" and that this score represented 4.3 percent of all scores, there were three occurrences of the score "77" and this score represented 13.0 percent of the entire data set. In turn, there were two occurrences of the score "82" and two occurrences of the score "83." Score Exam Score Valid Cum Value Label Value Frequency Percent Percent Percent 49 1 4.3 4.3 4.3 56 1 4.3 4.3 8.7 67 1 4.3 4.3 13.0 71 2 8.7 8.7 21.7 73 1 4.3 4.3 26.1 76 1 4.3 4.3 30.4 77 3 13.0 13.0 43.5 82 2 8.7 8.7 52.2 83 2 8.7 8.7 60.9 84 1 4.3 4.3 65.2 86 1 4.3 4.3 69.6 88 1 4.3 4.3 73.9 89 1 4.3 4.3 78.3 91 1 4.3 4.3 82.6 92 2 8.7 8.7 91.3 97 1 4.3 4.3 95.7 100 1 4.3 4.3 100.0 ------- ------- ------- Total 23 100.0 100.0 -- The last seven lines of the output file were generated because of the / STATISTICS = ALL command. In this example, you will notice that: there were 23 valid cases to this data set, the Mean was 80.130, the Median was 82, the Mode was 77, the Standard Deviation (Std dev) was 12.211, and the Range was 51(Minimum = 49 to Maximum = 100). Mean 80.130 Std err 2.546 Median 82.000 Mode 77.000 Std dev 12.211 Variance 149.119 Kurtosis .914 S E Kurt .935 Skewness -.813 S E Skew .481 Range 51.000 Minimum 49.000 Maximum 100.000 Sum 1843.000 Valid cases 23 Missing cases 0 Error Messages and Warning Messages in a SPSS Output File ---------------------------------------- If you have illogical code in the SPSS run file or an outright mistake in some way or another, the SPSS output file will have an Error message. In most cases, the Error message will give you the line/column number in the run file which caused the Error. Quite frequently, the Error message will give you the information needed to correct the run file. Warning messages are also common, and by no means do they mean that you have a problem. Instead, they merely warn you about something in the output file that needs some degree of attention. 5. topic.con (con = conclusion file) By choice, I always summarize the analysis in the topic.con conclusion file. When working with inferential tests such as Student's t-Test or Oneway ANOVA, I'll be sure to summarize outcomes by referring back to the Null Hypothesis (Ho). A sample conclusion from a study on student examination scores in a C programming class may be useful at this point, with more exposure to this area presented in later tutorials: -- Ho (Null Hypothesis): There is no difference in examination scores in a C programming class between students from the various townships representing the general sending district of Cape May County (e.g., Upper Township, Middle Township, Lower Township, Cape May City) at p <= .05. -- Outcome: Computed F = 6.79 Criterion F = 3.29 (alpha = .05, df = 3,15) Computed F (6.79) > Criterion F (3.29) -- Conclusion: The computed F statistic exceeds the criterion F statistic and the null hypothesis is not accepted. That is to say, there are differences in examination scores in the C programming class between students from the various townships representing the general sending district of Cape May County (e.g., Upper Township, Middle Township, Lower Township, Cape May City) at at alpha (or p) = .05. Again, the conclusion file is a summary of outcomes from the analysis and this summary should be helpful when you prepare the results section of your thesis/ dissertation. 6. topic.lis (lis = list file for the MINITAB analysis) By using SPSS in batch mode, I am able to develop run files that support a complex array of statistical analyses. Further, these SPSS run files can serve as modules for later analyses. Some editing many be required, but it is very common to save up to 80 percent of your SPSS programming time after you develop a personal collection of run files and then use these run files in other analyses. However, there are times when you just want to do a quick analysis and you do not want to spend too much time developing a complex and elaborate run file. When this is the case, I find MINITAB in interactive mode a very useful software program. As such, I end each tutorial with a MINITAB addendum, showing the use of MINITAB in an interactive mode. If you have MINITAB available on your campus computing system, you may find it interesting to try a few these sample interactive sessions. Again, ask your system operator if MINITAB is available on your campus computing system. At the % UNIX prompt, I go into MINITAB by keying: % minitab Then, I save the output by placing it into a topic.lis list file, which is identified in each MINITAB addendum. File Placement at an Online Host Computing System --------------------------- As you will notice in the Limitations section of this introduction, it is only expected that there will be considerable variance in computing hardware and software among the individuals who use this series of tutorials. You will also notice that my examples are based on the use of SPSS and MINITAB at an online host computing system, instead of using the PC versions of these two software programs. If you use a PC-based version of SPSS, nearly all of the examples in these tutorials can be (and should be) replicated in the dialog box. Again, you may need to contact your local system administrator to learn how to access online files. If you prepare the ASCII- based run file and data file offline, using edit or any other word processing program, you will also need to learn how to upload files from your PC to the online host computer. Of course, you will also want to learn how to download the output files from the online host computer to your PC. Conclusion ---------- Best wishes as you continue with these tutorials. Send e-mail to t_macfarland@hotmail.com if you have questions about these tutorials and how they can help you complete the statistical analyses associated with your graduate research project. -------------------------- Disclaimer: All care was used to prepare the information in this tutorial. Even so, the author does not and cannot guarantee the accuracy of this information. The author disclaims any and all injury that may come about from the use of this tutorial. As always, students and all others should check with their advisor(s) and/or other appropriate professionals for any and all assistance on research design, analysis, selected levels of significance, and interpretation of output file(s). The author is entitled to exclusive distribution of this tutorial. Readers have permission to print this tutorial for individual use, provided that the copyright statement appears and that there is no redistribution of this tutorial without permission. Prepared 980316 Revised 980914 end-of-file 'introduc.ssi'