Program 4A: Linear Regression Analysis

Using a linked list, write a program to calculate the mean and standard deviation of a set of data

Given Requirements

 

Requirements:Write a program to calculate the linear regression size-estimating parameters for a set of n programs where historical object LOC and new and changed LOC are available. Linear regression and the required formulas are explained in section A7. Enhance the linked list of program 1A to hold the n data records, where each record holds two real numbers

Testing:Thoroughly test the program. At a minimum, use this program to calculate the beta parameters for three cases [taken from table D8 in the text and from programs 2A, 3A, 4A]... Prepare a report of your tests that includes a table of the planned and actual results from these tests in the format in table D9

 
--[Humphrey95] 

Table 4-1. Test Results Format-- Program 4A

TestExpected B0Expected B1Actual B0Actual B1
Table D8: Estimated Class vs Actual New and Changed LOC-22.551.7279  
Table D8: Estimated New and Changed LOC vs Actual New and Changed LOC-23.921.4310  
Program 2A: Estimated New and Changed LOC vs Actual New and Changed LOC    
Program 3A: Estimated New and Changed LOC vs Actual New and Changed LOC    
Program 4A: Estimated New and Changed LOC vs Actual New and Changed LOC    

Planning

My goodness! PSP1 comes complete with a very elaborate script for calculating linear regression parameters and evolving a complex size and time estimate. It's obvious that program 3a is intended to help, but it comes too late-- I wasted a great deal of time calculating the linear regression parameters by hand, and simply made an educated guess on the time estimate rather than running through the process again. This portion needs tools in a big way. I actually recommend to anyone learning the PSP that they get these tools ahead of time. I commend Humphrey on making tool construction part of the syllabus, but the tools are coming after they are needed at least once. Perhaps this is done to impress the student with the necessity of automation, but it has the unfortunate effect of artificially inflating the time spent in the planning phase (if the planning phase is incredibly long on one assignment because of hand calculations and shortened dramatically on the next due to automation, it will skew things a bit!).

I'm planning on reusing the number_list and simple_input_parser classes from program 1A; through the estimation script (which I think is artificially inflating estimates from programs 2A and 3A, which ended up significantly larger than expected), I'm estimating (through the PROBE process) 143 new LOC, and (through analogy and guessing) about 2 hours of development time.

Development

Design

The design will differ very slightly from that evidenced by the conceptual design used in planning. I will indeed reuse the number_list class from lesson 1, but will delete i/o functionality from it, creating a paired_number_list class, which holds two number_lists. I'm removing the I/O in order to refactor it into a new class, and taking two lists of single numbers as it will work better than adapting the original class to hold pairs of numbers, because the original mean() functionality will be easily encapsulated (and we use the means of each runs in the linear regression calculations).

The i/o will be brought in by reusing the simple_input_parser class from programs 2A and 3A. A third class will mix the two using multiple inheritance (paired_number_list_reader), interpreting an i/o stream as a set of comma-separated value lines in the form of x,y. The main program will require very minor changes (new instantiation, new printing instructions).

Code

Not too bad; this was a fairly straightforward adaptation of the design. I did discover that I'd been doing the beta calculations for my estimate incorrectly; now, of course, I have a tool to automate this for next time.

simple_input_parser.h, simple_input_parser.cpp

The simple_input_parser class was used unchanged from programs 2A and 3A.

number_list

number_list was used almost whole, but with the removal of the i/o features.

/*
*/

#ifndef NUMBER_LIST_H
#define NUMBER_LIST_H

#include <list>
#include <iostream>

//a class which encapsulates a list of double values, adding the features of
//mean and standard deviation
class number_list : public list< double >
{
 public:
  double sum( void ) const;
  double mean( void ) const;
  double standard_deviation( void ) const;
  int entry_count( void ) const;
  void add_entry( double new_entry );
};

#endif

/*
*/
/*
*/

#include "number_list.h"
#include <assert.h>
#include <stdlib.h>
#include <math.h>


double 
number_list::sum( void ) const
{
  double Result = 0;
  for ( list<double>::const_iterator iter = begin();
	iter != end();
	++iter )
    {
      Result += *iter;
    }
  return Result;
}

double 
number_list::mean( void ) const
{
  assert( entry_count() > 0 );
  return sum() / entry_count();
}

double 
number_list::standard_deviation( void ) const
{
  assert( entry_count() > 1 );
  double sum_of_square_differences = 0;
  for ( list<double>::const_iterator iter = begin();
	iter != end();
	++iter )
    {
      const double this_square_difference = *iter - mean();
      sum_of_square_differences += this_square_difference * this_square_difference;
    }
  return sqrt( sum_of_square_differences / ( entry_count() - 1 ));
}

int 
number_list::entry_count( void ) const
{
  return size();
}

void 
number_list::add_entry( double new_entry )
{
  push_back( new_entry );
}

/*
*/

paired_number_list

/*
*/

#ifndef PAIRED_NUMBER_LIST_H
#define PAIRED_NUMBER_LIST_H

#ifndef NUMBER_LIST_H
#include "number_list.h"
#endif

class paired_number_list : public number_list
{
 public:
  //modifiers
  void add_entry( double x, double y );
  void reset( void );
  paired_number_list( void );
  //number of entries
  int entry_count( void ) const;

  double x_sum( void ) const;
  double x_mean( void ) const;
  double x_standard_deviation( void ) const;

  double y_sum( void ) const;
  double y_mean( void ) const;
  double y_standard_deviation( void ) const;

  double beta_1_numerator( void ) const;
  double beta_1_denominator( void ) const;
  double beta_1( void ) const;
  double beta_0( void ) const;
 protected:
  number_list m_xs;
  number_list m_ys;
};


#endif
/*
*/
/*
*/

#include "paired_number_list.h"
#ifndef CONTRACT_H
#include "contract.h"
#endif

void
paired_number_list::reset( void )
{
  m_xs.clear();
  m_ys.clear();
}

paired_number_list::paired_number_list( void ) 
{
  reset();
}


int
paired_number_list::entry_count( void ) const
{
  REQUIRE( m_xs.entry_count() == m_ys.entry_count() );
  return m_xs.entry_count();
}

void
paired_number_list::add_entry( double x, double y )
{
  m_xs.add_entry( x );
  m_ys.add_entry( y );
}

double
paired_number_list::x_sum( void ) const
{
  return m_xs.sum();
}

double
paired_number_list::x_mean( void ) const
{
  return m_xs.mean();
}

double
paired_number_list::x_standard_deviation( void ) const
{
  return m_xs.standard_deviation();
}

double
paired_number_list::y_sum( void ) const
{
  return m_ys.sum();
}

double
paired_number_list::y_mean( void ) const
{
  return m_ys.mean();
}

double
paired_number_list::y_standard_deviation( void ) const
{
  return m_ys.standard_deviation();
}

double
paired_number_list::beta_1_numerator( void ) const
{
  double Result = 0;
  list<double>::const_iterator x_iter;
  list<double>::const_iterator y_iter;

  for( x_iter = m_xs.begin(), y_iter = m_ys.begin();
       (x_iter != m_xs.end()) && (y_iter != m_ys.end());
       ++x_iter, ++y_iter )
    {
      Result += (*x_iter)*(*y_iter);
    }
  Result -= static_cast<double>(entry_count())*x_mean()*y_mean();
  return Result;
}

double
paired_number_list::beta_1_denominator( void ) const
{
  double Result = 0;
  list<double>::const_iterator x_iter;
  list<double>::const_iterator y_iter;
  for( x_iter = m_xs.begin(), y_iter = m_ys.begin();
       (x_iter != m_xs.end()) && (y_iter != m_ys.end());
       ++x_iter, ++y_iter )
    {
      Result += (*x_iter)*(*x_iter);
    }
  Result -=  static_cast<double>(entry_count())*x_mean()*x_mean();
  return Result;
}

double
paired_number_list::beta_1( void ) const
{
  return beta_1_numerator() / beta_1_denominator();
}

double
paired_number_list::beta_0( void ) const
{
  return y_mean() - beta_1() * x_mean();
}

/*
*/

readable_paired_number_list

/*
*/

#ifndef READABLE_PAIRED_NUMBER_LIST_H
#define READABLE_PAIRED_NUMBER_LIST_H

#ifndef PAIRED_NUMBER_LIST_H
#include "paired_number_list.h"
#endif
#ifndef SIMPLE_INPUT_PARSER_H
#include "simple_input_parser.h"
#endif

class readable_paired_number_list : public paired_number_list, public simple_input_parser
{
 public:
  virtual void parse_last_line( void );
  virtual void reset( void );
  void clear_error_flag( void );
  void check_error( bool condition, const std::string& message );
  void report_error( const std::string& message );
  readable_paired_number_list::readable_paired_number_list( void );
 protected:
  int line_number;
  bool error_found;
};

#endif
/*
*/
/*
*/


#include "readable_paired_number_list.h"

void
readable_paired_number_list::parse_last_line( void )
{
  clear_error_flag();
  //split the string around the comma
  std::string::size_type comma_index = last_line().find( ',' );
  check_error( comma_index == last_line().npos, "No comma" );
  std::string x_string = last_line().substr( 0, comma_index );
  std::string y_string = last_line().substr( comma_index + 1, last_line().length() );
  //get values for each double and ensure they're valid
  char* conversion_end = NULL;
  double new_x = strtod( x_string.c_str(), &conversion_end );
  check_error( conversion_end == x_string.data(), "X invalid" );
  double new_y = strtod( y_string.c_str(), &conversion_end );
  check_error( conversion_end == y_string.data(), "Y invalid" );
  //add the entry
  if ( ! error_found ) 
    {
      cout << "added: " << new_x << ", " << new_y << "\n";
      add_entry( new_x, new_y );
    }
}

void
readable_paired_number_list::clear_error_flag( void ) 
{
  error_found = false;
}

void
readable_paired_number_list::reset( void ) 
{
  line_number = 0;
  clear_error_flag();
}

readable_paired_number_list::readable_paired_number_list( void ) :
paired_number_list(),
simple_input_parser()
{
  reset();
}

void
readable_paired_number_list::check_error( bool condition, const std::string& message )
{
  if ( condition ) 
    {
      error_found = true;
      report_error( message );
    }
}

void
readable_paired_number_list::report_error( const std::string& message )
{
  cerr << "input:" << line_number << ":" << message << ":" << last_line() << "\n";
}

/*
*/

main.cpp

/*
*/

#include <fstream>
#include <iostream>
#include "string.h"

#ifndef READABLE_PAIRED_NUMBER_LIST_H
#include "readable_paired_number_list.h"
#endif

istream *
input_stream_from_args( int arg_count, const char** arg_vector )
{
  istream* Result = NULL;
  if ( arg_count == 1 )
    {
      Result = &cin;
    }
  else 
    {
      const char* help_text = "\
PSP exercise 4A: Calculate the linear regression parameters from
a set of comma-separated values, from standard input.
Usage: \n \
\tpsp_1a\n\n";
      cout << help_text;
    }
  return Result;
}

int main( int arg_count, const char** arg_vector )
{
  //get the input stream, or print the help text as appropriate
  istream* input_stream = input_stream_from_args( arg_count, arg_vector );
  if ( input_stream != NULL ) 
    {
      //read the entries from the input stream
      readable_paired_number_list a_list;
      a_list.set_input_stream( input_stream );
      a_list.parse_until_eof();
      cout << a_list.entry_count() << " entries processed.\n";
      //output the mean, as appropriate
      if ( a_list.entry_count() > 0 )
	{
	  cout << "Beta 0: " << a_list.beta_0() << "\n";
	  cout << "Beta 1: " << a_list.beta_1() << "\n";
	}
      else
	{
	  cout << "Too few entries to calculate\n";
	}
    }
}

/*
*/

number_list.e

class NUMBER_LIST

inherit 
   LINKED_LIST[DOUBLE];
   
creation {ANY} 
   make

feature {ANY} 
   
   line_counter: INTEGER;
   
   sum: DOUBLE is 
      local 
         i: INTEGER;
      do  
         from 
            i := lower;
         until 
            not valid_index(i)
         loop 
            Result := Result + item(i);
            i := i + 1;
         end; 
      end -- sum
   
   mean: DOUBLE is 
      require 
         count > 0; 
      do  
         Result := sum / count.to_double;
      end -- mean
   
   standard_deviation: DOUBLE is 
      require 
         count > 1; 
      local 
         i: INTEGER;
         this_term_difference: DOUBLE;
         top_term: DOUBLE;
      do  
         from 
            i := lower;
         until 
            not valid_index(i)
         loop 
            this_term_difference := item(i) - mean;
            top_term := top_term + this_term_difference ^ 2;
            i := i + 1;
         end; 
         Result := (top_term / (count - 1)).sqrt;
      end -- standard_deviation

end -- NUMBER_LIST

paired_number_list.e

class PAIRED_NUMBER_LIST
   
creation
   make

feature { NONE }
   
   xs : NUMBER_LIST
   
   ys : NUMBER_LIST
   
feature {ANY}
   
   add_entry( new_x, new_y : DOUBLE ) is
      do
	 xs.add_last( new_x )
	 ys.add_last( new_y )
	 std_output.put_double( new_x )
	 std_output.put_string( ", " )
	 std_output.put_double( new_y )
	 std_output.put_string( "%N" )
      end
   
   count : INTEGER is
      do
	 Result := xs.count
      end
   
   x_sum : DOUBLE is
      do
	 Result := xs.sum
      end
   
   x_mean : DOUBLE is
      do
	 Result := xs.mean
      end
   
   x_standard_deviation : DOUBLE is
      do
	 Result := xs.standard_deviation
      end
   
   y_sum : DOUBLE is
      do
	 Result := ys.sum
      end
   
   y_mean : DOUBLE is
      do
	 Result := ys.mean
      end
   
   y_standard_deviation : DOUBLE is
      do
	 Result := ys.standard_deviation
      end
   
   entry_count : INTEGER is
      do
	 Result := xs.count
      end
   
   beta_1_numerator : DOUBLE is
      local
	 index : INTEGER
      do
	 Result := 0
	 from
	    index := xs.lower
	 until
	    not ( xs.valid_index( index ) and ys.valid_index( index ) )
	 loop
	    Result := Result + xs.item(index) * ys.item( index )
	    index := index + 1
	 end
	 Result := Result - entry_count.to_double * x_mean * y_mean
      end
   
   beta_1_denominator : DOUBLE is
      local
	 index : INTEGER
      do
	 Result := 0
	 from
	    index := xs.lower
	 until
	    not ( xs.valid_index( index ) and ys.valid_index( index ) )
	 loop
	    Result := Result + xs.item(index) * xs.item( index )
	    index := index + 1
	 end
	 Result := Result - entry_count.to_double * x_mean * x_mean
      end
   
   beta_1 : DOUBLE is
      do
	 Result := beta_1_numerator / beta_1_denominator
      end
   
   beta_0 : DOUBLE is
      do
	 Result := y_mean - beta_1 * x_mean
      end
   
   make is
      do
	 !!xs.make
	 !!ys.make
      end
   
   reset is
      do
	 xs.clear
	 ys.clear
      end
   
invariant
   
   xs.count = ys.count
   
end -- PAIRED_NUMBER_LIST


readable_paired_number_list.e

class READABLE_PAIRED_NUMBER_LIST
   
inherit
   SIMPLE_INPUT_PARSER
      redefine
	 parse_last_line
      end
   
   PAIRED_NUMBER_LIST
      redefine
	 make
      end
   
creation
   make

feature {ANY}
   
   make is
      do
	 Precursor
      end
	 
   parse_last_line is
	 --read comma-separated numbers
      local
	 comma_index : INTEGER
	 x_string : STRING
	 y_string : STRING
	 new_x : INTEGER
	 new_y : INTEGER
      do
	 clear_error_flag
	 comma_index := last_line.index_of( ',' )
	 check_for_error( comma_index = last_line.count + 1, "No comma" )
	 x_string := last_line.substring( 1, comma_index - 1 )
	 y_string := last_line.substring( comma_index + 1, last_line.count )
	 check_for_error( not ( x_string.is_double or x_string.is_integer), "invalid X" )
	 check_for_error( not (y_string.is_double or y_string.is_integer), "invalid Y" )
	 if not error_flag then
	    add_entry( double_from_string( x_string ), double_from_string( y_string ) )
	 end
      end
   
   double_from_string ( string : STRING ) : DOUBLE is
      require
	 string.is_double or string.is_integer
      do
	 if string.is_double then
	    Result := string.to_double
	 elseif string.is_integer then
	    Result := string.to_integer.to_double
	 end
      end
   
   error_flag : BOOLEAN
   
   clear_error_flag is
      do
	 error_flag := false
      end
   
   check_for_error( condition : BOOLEAN; message : STRING ) is
      do
	 if condition then
	    error_flag := true
	    report_error( message )
	 end
      end
   
   report_error( message : STRING ) is
      do
	 std_error.put_string( message + ":" + last_line + "%N" )
      end
   
end

                                                        

main.e

class MAIN

creation
   make
   
feature { ANY }
   
   make is
      local
	 number_list : READABLE_PAIRED_NUMBER_LIST
      do
	 !!number_list.make
	 number_list.set_input( io )
	 number_list.parse_until_eof
	 --now output results
	 if ( number_list.count > 0 ) then
	    std_output.put_string( "Beta 0: " )
	    std_output.put_double( number_list.beta_0 )
	    std_output.put_string( "%NBeta 1: " )
	    std_output.put_double( number_list.beta_1 )
	    std_output.put_string( "%N" )
	 else
	    std_output.put_string( "Not enough entries for calculations" )   
	 end --if
      end
   
end-- MAIN

Compile

Once again, I learn my lessons on syntax errors and proper use of components. After some time, and some searching through the header files, I finally found the "c_str()" function for C++ standard library strings; I had erroneously been using data(), which is not necessarily terminated. A good book probably would have shown this to me (the header files are not very helpful, and C++ templates are rather difficult to understand).

While I did have some minor problems with the Eiffel standard library, the HTML documentation generated by the short utility has been very readable and extremely useful. While some standard Eiffel library features could be more useful (for example, class STRING comes with a split feature-- but only splits on a set of defined separators, not including the comma), it's very solid and readable.

Test

Few problems here. The few which existed came from either poor use of libraries or my own error misunderstanding the linear regression algorithm.

Table 4-2. Test Results Format-- Program 4A (results essentially identical for Eiffel and C++ versions)

TestExpected B0Expected B1Actual B0Actual B1
Table D8: Estimated Class vs Actual New and Changed LOC-22.551.7279-22.55251.72793
Table D8: Estimated New and Changed LOC vs Actual New and Changed LOC-23.921.4310-23.92391.42097
Programs 2A, 3A, 4A: Estimated New and Changed LOC vs Actual New and Changed LOC  850.905-5.32154

Incidentally, the correct beta values for my own code should have shifted my new/changed estimate from 143 LOC to a ludicrous 569 LOC. I think that lessons 2 and 3 probably warped things quite a bit, and now wish that I was not including them in my data.

Postmortem

PSP1 Project Plan Summary

Table 4-3. Project Plan Summary

Student:Victor B. PutzDate:000106
Program:Regression Beta AnalyzerProgram#4A
Instructor:WellsLanguage:C++
Summary LOC/HourPlanActualTo date
 ?4141
Program SizePlanActualTo date
Base2828 
Deleted06 
Modified44 
Added139129 
Reused106 8787
Total New and Changed143129686
Total LOC273236966
Total new/reused000
Time in Phase (min):PlanActualTo DateTo Date%
Planning77911614
Design13107910
Code354122627
Compile89526
Test473328735
Postmortem1016668
Total120188826100
Defects Injected ActualTo DateTo Date %
Plan 000
Design 62126
Code 95568
Compile 033
Test 033
Total development 1582100
Defects Removed ActualTo DateTo Date %
Planning 000
Design 000
Code 41923
Compile 93543
Test 22834
Total development 1582100
After Development 00 
Eiffel code/compile/test
Time in Phase (min)ActualTo DateTo Date %
Code2811543
Compile76324
Test158933
Total50267100
Defects InjectedActualTo DateTo Date %
Design048
Code74590
Compile000
Test112
Total850100
Defects RemovedActualTo DateTo Date %
Code012
Compile42856
Test42142
Total850100

Time Recording Log

Table 4-4. Time Recording Log

Student:Victor B. PutzDate:000106
  Program:4A
StartStopInterruption TimeDelta timePhaseComments
000106 17:18:21000106 18:38:07079plan 
000107 10:12:58000107 10:23:24010design 
000107 10:29:27000107 11:10:40041code 
000107 11:12:34000107 11:21:4409compile 
000107 11:21:50000107 11:55:36033test 
000107 11:56:10000107 12:12:38016postmortem 
      

Table 4-5. Time Recording Log

Student:Victor PutzDate:000107
  Program:4A
StartStopInterruption TimeDelta timePhaseComments
000107 12:17:13000107 12:44:52027code 
000107 12:44:56000107 12:51:3506compile 
000107 12:51:48000107 13:06:41014test 
      

Defect Reporting Logs

Table 4-6. Defect Recording Log

Student:Victor B. PutzDate:000106
  Program:4A
Defect foundTypeReasonPhase InjectedPhase RemovedFix timeComments
000107 10:34:11icomdesigncode1forgot to design a constructor
000107 10:58:03icomdesigncode3no line counter in new i/o section; added
000107 11:01:22icomdesigncode1Added error reporting
000107 11:04:12icomdesigncode2added error flag for more beneficial error reporting
000107 11:12:37wntycodecompile0mistyped "readable_paired_number_list"
000107 11:14:00sytycodecompile0missed semicolon at end of class declaration
000107 11:14:45wntycodecompile0typed a_liste instead of a_list
000107 11:15:24syomcodecompile0forgot const on end of method implementation
000107 11:16:11syomcodecompile0forgot to #include required file
000107 11:17:18syigcodecompile1could not declare two iterators in for clause; moved declarations out
000107 11:19:43sytycodecompile0forgot parentheses around no-argument feature
000107 11:20:14wnomcodecompile0Forgot to prepend class name to clear_error method
000107 11:21:17syomcodecompile0forgot to declare constructor in class declaration (only added body in cpp file)
000107 11:25:49wnigdesigntest5was using data() instead of c_str() to get c-readable string from the std::string class.
000107 11:35:00waigdesigntest15Incorrectly assumed that the summation symbol applied to the whole equation, not just the first part.
       

Table 4-7. Defect Recording Log

Student:Victor PutzDate:000107
  Program:4A
Defect foundTypeReasonPhase InjectedPhase RemovedFix timeComments
000107 12:45:00icomcodecompile1forgot to add count feature
000107 12:47:02syomcodecompile0forgot to add "Redefine make" clause
000107 12:48:00wnigcodecompile0used wipe_out instead of clear to empty linked list
000107 12:50:38mcigcodecompile0forgot to call precursor in make
000107 12:53:05ivigcodetest0forgot-- Eiffel strings start at 1, not 0
000107 12:54:58waigcodetest3forgot that to_double doesn't work with straight integers; added double_from_string feature
000107 12:58:43wntycodetest4accidentally typed ys instead of xs
000107 13:04:37sytytesttest0test data did not have a terminating newline after the last data entry