** Chapter 3 ** for W.H. Greene, Econometric Analysis 6th ed. ******************
* (c) Noel Roy 2003, 2008
*
*                            LEAST SQUARES
*
* The tutorial for this chapter will review the following tasks in SHAZAM:
*   •	Reading data files (READ command)
*   •  Setting sample size (SAMPLE and TIME commands)
*   •	Creating variables (GENR command)
*   •	Generating a time trend (TIME function)
*   •	Generating lagged variables (LAG function)
*   •	Producing descriptive statistics (STAT command)
*   •	Correlation coefficients (STAT / PCOR)
*   •	Ordinary Least Squares regression (OLS command)
*   •	Analysis of Variance (OLS / ANOVA)
*
* You may run this command file in SHAZAM in order to replicate the examples
* in Chapter 3 of Greene's textbook.
*
*===============================================================================
* 
* A line preceded by an asterisk is ignored by SHAZAM. Such lines will be 
* used as comment lines to briefly describe the command and options as they
* are initially introduced to the user. For a complete description of these
* and other options, see the SHAZAM's User's Reference Manual.
*
*===============================================================================
*
* READING DATA FILES

*Most of the data files which will be used in these tutorials consist of
*plaintext data, delimited by spaces, with observations for all variables
*on separate lines. Normally, the first line will consist of a header line
*containing variable names. Such files can be read very easily using the
*READ command with the NAMES option as follows:
*
*           READ(filename) /NAMES
*
*The NAMES option is used to indicate that the NAMES in the header line
*should be used (the forward slash symbol / is used to introduce options). 
*Variable names in SHAZAM may be up to 8 characters long and must consist 
*only of letters or numbers and start with a letter. Where the data file
*comes without variable names, or where some of these names may not be
*suitable, the variable names must be specified by the READ command
*as in READ (filename) varnames.
*
*When the data are not in a separate file, the data directly follow the READ command.
*
*The filename can be any legal name in the current directory. If the file is
*not in the current directory, the complete pathname must be given. The current
*directory can be determined by giving the FILE PWD command, and can be changed
*by the FILE CD folder command, where folder is the name of the new directory.
*
FILE PWD
*
* In the Professional Edition, the default folder for files can be set through
* the Project/Options... menu item. 
*
*While we will not be using this feature, the READ command can also read 
*data from an Excel spreadsheet (.XLS extension), if the XLS option
*is given after the READ command. The first row of the spreadsheet
*MUST contain variable names.
*
*While any filename can be used with a READ command, the Data Editor in
*the Professional Edition will import space delimited data only from files
*with a .PRN extension. While we will not be using the Data Editor in
*these tutorials, its use has some advantages (see Chapter 45 of the
*SHAZAM Manual for further information,so where the formatting of the file
*permits, we have renamed data files to have a .PRN extension to permit its use.
*
*Further information about the READ command can be obtained from 
*chapter 3 of the SHAZAM Manual.
*
*===============================================================================
*
* 3.2 LEAST SQUARES REGRESSION
*
* 3.2.2 (p. 22) Application: an Investment Equation
*
* First attempt to replicate Table 3.1 in the textbook, using the raw data from
* Data Table F3.1 (see p. 947). We begin by reading the data file.
*
READ (TableF3-1.prn) / NAMES LIST
*
* If the current sample has not been set (which is the case here), SHAZAM
*reads the data to the end of the file, then sets the sample implicitly
*in accordance with the number of observations that have been read.
*In this case, the sample consists of 15 yearly observations, so each observation 
*is denoted by a number from 1 through 15. The LIST option lists the data, which 
* can be useful with small datasets to confirm that the data are being read 
* correctly.
*
* Now Replicate Table 3.1. First, convert the nominal GNP and Invest variables 
* to real terms (deflating by CPI) and scale them so they are measured in 
* trillions (not hundreds of billions) of dollars. New variables are created
* using the GENR (for GENeRate) command, as in 
*
GENR Y=Invest/CPI/10
GENR G=GNP/CPI/10
*
* The GENR command has the format GENR newvar=expression, where expression 
* is an arithmetic expression involving exiting variables, constants, and
* mathematical expressions. The command is described further in Chapter 6
* of the SHAZAM manual.
*
* The GENR command supports a number of special functions. The TIME(x) function
* generates a time trend beginning at value x+1. So the trend variable in Table
* 3.1 can be generated by
*
GENR T=TIME(0)
*
* Also generate a variable for the inflation rate (percentage rate of change
* in CPI). The LAG(x,n) function with the GENR command lags a variable x n 
* time periods. If the n is omitted, the series will be lagged one period. 
* Notice, however, that the value of this function is undefined for the first
* observation, since we have no data for the preceding year. When this 
* happens, SHAZAM inserts -99999 (by default) for the missing observation.
* This default value can be modified by the SET MISSV= command.
*
GENR CPILAG=LAG(CPI)
PRINT Year CPILAG
*
* Appendix F tells us that CPI 1967 is 79.06. We can include this by setting the
* sample to the first observation only, and changing the value of CPILAG to this
* number, then resetting the sample. This is accomplished through the
* SAMPLE command, which specifies the beginning and ending observations for 
* subsequent commands. 
*
SAMPLE 1 1
GENR CPILAG=79.06
SAMPLE 1 15
PRINT Year CPILAG
*
* Now generate the Inflation Rate variable.
* 
GENR P=((CPI/CPILAG)-1)*100
*
* Without information for CPI 1967, we would have had to discard the first 
* observation (using the command SAMPLE 2 15) whenever we used the variable P.
* IT IS A COMMON ERROR TO FORGET TO DO THIS.
*
* Print Table 3.1.
*
PRINT Y T G Interest P
*
* The regression results can be generated by the OLS command. The OLS command 
* performs Ordinary Least Squares regressions where the first variable listed 
* is the dependent variable. For example, the results on p. 23 can be obtained
* by
*
OLS Y T G
*
* Note that SHAZAM by default estimates a constant term in the regression, and
* there is no need to explicitly allow for it in the OLS command. However,
* unlike most presentations, SHAZAM reports the constant term last.
*
* The results on the top of p. 25 can be obtained by the command
*
OLS Y T G Interest P
*
* The results produced by SHAZAM do not exactly replicate the results in the
* textbook. This is because the textbook bases its calculations on the rounded
* data in Table 3.1, while SHAZAM uses full-precision results from the raw data.
*
*
*===============================================================================
*
* 3.4 PARTIAL REGRESSION AND PARTIAL CORRELATION COEFFICIENTS
*
* Example 3.1 (p. 31): Partial Correlations
*
* The simple correlation coefficients can be obtained from the STAT
* command. The STAT command computes means, standard deviations, variances, 
* minima, and maxima for the variables listed. The PCOV and PCOR options with 
* the STAT command print the matrix of covariances and correlation coefficients
* of pairs of the listed variables. The column of simple correlation coefficients
* between investment Y and the four regressors can be obtained from the first
* column of the correlation matrix generated by the STAT command
*
STAT Y T G Interest P / PCOR
*
* The partial correlation coefficients are printed with the results of the OLS
* command (in the sixth column).
*
*
*===============================================================================
*
* 3.6 GOODNESS OF FIT AND ANALYSIS OF VARIANCE
*
* Example 3.3 (p. 34-35) Analysis of Variance for an Investment Equation
*
* The OLS command automatically reports the coefficient of determination as 
* R-SQUARE, and also reports R-SQUARE ADJUSTED (see p. 35), and it also reports
* the sum of squared errors (SSE). But it does not do an Analysis of Variance
* unless the ANOVA option is used, as in
*
OLS Y T G Interest P /ANOVA
*
* The Amemiya Prediction Criterion PC (see p. 37) and a number of alternative
* model selection criteria are also calculated.
*
*
* Example 3.2 (p. 34) Fit of a Consumption Function
*
* This example is based on the Consumption (C) and disposable income (X) data
* in Table F2.1.
*
* Since the investment data are no longer needed, it is good practice to delete
* them from the workspace.
*
DELETE /ALL
* 
* When referencing specific years in a time series, it is often convenient to
* use an alternative form of the SAMPLE command in which the SAMPLE range is 
* specified in dates not observation numbers. Such a SAMPLE command must be
* preceded by a TIME command which specifies the beginning year and frequency 
* for the time series. The general format is TIME beg freq, where beg specifies 
* the start of the series; freq specifies the frequency (e.g., 1-annual,
* 4-quarterly, 12-monthly -- annual is default). Since Table F2.1 consists of 
* annual data for the period 1940-1950, we can use the commands
*
TIME 1940 1
SAMPLE 1940.0 1950.0
*
* Note the use of the decimal in the SAMPLE command, which is used to indicate
* the quarter or month (as appropriate) of the beginning and ending observations
* but is necessary even with annual data in order to distinguish this form of
* the SAMPLE command from the other form.
*
READ (TableF2-1.prn) Year x y W / SKIPLINES=1 LIST
*
* Note that we have chosen y rather than C as the variable name for consumption,
*in order to use the same notation as in the text. Therefore, we must specify
*the variable names in the READ command. (An alternative would have been to use
*the RENAME command, as in RENAME C y). Because the data file has a header line
*at the top, which we are not using, SHAZAM must be told to skip that line,
*which is the purpose of the / SKIPLINES= option.
*
* The STAT command gives sample means of the listed data. The PCPDEV option 
* prints a cross-product matrix of the variables listed in deviations from the 
* means.
*
STAT x y / PCPDEV
*
* Analysis of variance:
*
OLS y x /ANOVA
*
* We can omit the war years 1942-45 from the sample simply by modifying the
* SAMPLE command. Two or more discontinuous intervals can be chained together
* as in
*
SAMPLE 1940.0 1941.0 1946.0 1950.0
OLS y x
*
* Alternatively, we can account for the war years by using the dummy variable W.
*
SAMPLE 1940.0 1950.0
OLS y x W
*
STOP
*===============================================================================
*
* Updated August 27, 2008.