"

Chapter 11 – Multicollinearity

Each of the slope coefficients in a multiple regression model provides a measure of the marginal effect associated with a one-unit change in a particular independent variable, holding constant the effect of all other variables. These individual marginal effects can be estimated, however, only if each independent variable varies in a manner that cannot be fully accounted for by the other independent variables. If the variation on one or more of the independent variables can be fully explained by variation in the other variables, then perfect multicollinearity is said to exist. This chapter begins with a discussion of the consequences of perfect multicollinearity.

Perfect multicollinearity rarely occurs in well specified regression models. In many econometric models, however, there is a substantial correlation between two or more of the independent variables. Many time-series variables, for example, are highly correlated since they share a substantial trend component. High correlations also tend to occur among scores on alternative tests of ability (such as IQ, SAT, and ACT exams). When this correlation is very high, it becomes difficult to determine the separate effects of the individual independent variables. In this case, a multicollinearity problem is said to occur. The latter portion of this chapter deals with the consequences of multicollinearity. Diagnostic tools and alternative remedial strategies are then discussed.

11. Perfect multicollinearity

In Chapter 6, the assumptions of the classical regression model were defined. Assumption6.6 requires that:

Assumption 6.6 None of the independent variables can be written as an exact linear combination of the other independent variables.

If this condition is violated, perfect multicollinearity is said to exist.

11.1.1 Consequences of perfect multicollinearity

When perfect multicollinearity is present, it is not possible to estimate all intercept and slope parameters in the regression model. To see this, let’s examine a simple case.
Suppose that a regression equation is given by: \begin{equation}
Y_i=\beta_0+\beta _1X_{1i}+\beta _2X_{2i}+u_i \tag{11.1} \end{equation}
Let’s suppose that the variable [latex]X_{1i}[/latex] equals a multiple of the variable [latex]X_{2i}[/latex], so that:
\begin{equation}
X_{1i}=cX_{2i} \tag{11.2}
\end{equation}
where [latex]c[/latex] is a constant. If an econometrician attempted to estimate the parameters of equation 11.1 by OLS techniques, it is necessary to find values of [latex]\beta_0,\beta _1[/latex] and [latex]\beta _2[/latex] that solve the normal equations:
\begin{equation}
\sum_{i=1}^N\left( Y_i-\hat{\beta}_0-\hat{\beta}_1X_{1i}-\hat{\beta}_2X_{2i}\right) =0 \tag{11.3}
\end{equation}
\begin{equation}
\sum_{i=1}^NX_{1i}\left( Y_i-\hat{\beta}_0-\hat{\beta}_1X_{1i}-\hat{\beta}_2X_{2i}\right) =0 \tag{11.4}
\end{equation}
\begin{equation}
\sum_{i=1}^NX_{2i}\left( Y_i-\hat{\beta}_0-\hat{\beta}_1X_{1i}-\hat{\beta}_2X_{2i}\right) =0 \tag{11.5}
\end{equation}
These equations provide three linear equations in the unknown parameters [latex]\hat{\beta}_0,\hat{\beta}_1[/latex], and [latex]\hat{\beta}_2[/latex]. Using the relationship in equation 11.2, however, equation 11.4 can be restated as:
\begin{equation*}
\sum_{i=1}^NcX_{2i}\left( Y_i-\hat{\beta}_0-\hat{\beta}_1X_{1i}-\hat{\beta}_2X_{2i}\right) = 0
\end{equation*}
Multiplying both sides of this equation by [latex]1/c[/latex] results in: \begin{equation*}
\sum_{i=1}^NX_{2i}\left( Y_i-\hat{\beta}_0-\hat{\beta}_1X_{1i}-\hat{\beta}_2X_{2i}\right) =0
\end{equation*}
Thus, it can be seen that equations 11.4 and 11.5 are equivalent and the normal equations provide only two independent equations in the three unknown variables [latex]\hat{\beta}_0,\hat{\beta}_1[/latex], and [latex]\hat{\beta}_2[/latex]. In this case, it is impossible to provide unique solutions for the slope parameters.

The intuition behind this result is quite straightforward. Each of the slope coefficients represents the effect of a one-unit change in the corresponding independent variable, holding constant the effect of all other variables. If [latex]X_1[/latex] is always equal to a linear multiple of [latex]X_2[/latex], then a change in the level of [latex]X_1[/latex] will always be associated with a corresponding change in the level of [latex]X_2[/latex]. In this case, it is impossible to determine how much of the resultant change in the dependent variable is the result of the change in [latex]X_1[/latex] and how much is due to the change in [latex]X_2[/latex]. Thus, it is impossible to estimate the magnitude of the separate effects of each of these variables when they are perfectly collinear.

In many cases, a perfect multicollinearity problem is the result of an error in formulating the regression model. Beginning econometricians often experience this problem when they are working with dummy variables. Suppose, for example that an econometrician attempts to estimate an earnings equation of the form:
\begin{equation*}
\ln (\text{earnings}_{i})=\beta _{0}+\beta _{1}\text{experience}_{i}+\beta _{2}\text{experience}_{i}^{2}+\beta _{3}\text{education}_{i} \end{equation*}
\begin{equation*}
+\beta _{4}\text{male}_{i}+\beta _{5}\text{female}_{i}+u_{i} \end{equation*}

where:

  • male[latex]_{i} = 1[/latex] if individual [latex]i[/latex] is male (= 0 otherwise)
  • female[latex]_{i}[/latex] = 1 if individual [latex]i[/latex] is female (= 0 otherwise)

For each observation, one of the gender dummy variables equals one while the other gender dummy variable will equal zero. Since the sum of these two dummy variables equals one for each observation, the constant term is a linear combination of these variables. This is an example of the dummy variable trap discussed in Chapter 9. To estimate a model of this type, one of the gender dummy variables must be excluded from the regression equation.[1] When a regression model includes dummy variables representing race, educational attainment, marital status, seasonal effects, or similar qualitative variables, the dummy variable corresponding to one of the possible outcomes must be excluded from the regression equation.

11.1.2 Diagnosis of perfect multicollinearity

If you are using a regression software package, the diagnosis of a perfect multicollinearity problem is quite straightforward: the program will refuse to provide regression estimates for all of the model parameters.[2]} An error message will generally appear that resembles one of the following:

  • regressors are collinear,
  • matrix is singular,
  • matrix cannot be inverted,
  • perfect multicollinearity detected, or
  • parameter estimates cannot be constructed.

11.1.3 Perfect multicollinearity: remedial measures

While it is impossible to estimate all of the parameters of a regression model when perfect multicollinearity is present, it is possible to estimate some of the parameters. Let’s consider a simple example.

Suppose that the regression relationship is given by: \begin{equation}
Y_i=\beta _0+\beta _1X_{1i}+\beta _2X_{2i}+\beta _3X_{3i}+u_i \tag{11.6} \end{equation}
Let’s assume that the variable [latex]X_{i1}[/latex] is an exact linear multiple of the variable [latex]X_{2i}[/latex] so that:
\begin{equation}
X_{1i}=cX_{2i} \tag{11.7}
\end{equation}
Using the relationship defined in equation 11.7, the regression relationship in equation 11.6 may be restated as: \begin{equation*}
Y_i=\beta _0+\beta _1cX_{2i}+\beta _2X_{2i}+\beta _3X_{3i}+u_i \end{equation*}
This may be simplified to:
\begin{equation}
Y_i=\beta _0+\left( \beta _1c+\beta _2\right) X_{2i}+\beta _3X_{3i}+u_i \tag{11.8} \end{equation}
Defining a new coefficient, [latex]\gamma[/latex], as:
\begin{equation*}
\gamma =\beta _1c+\beta _2
\end{equation*}
equation 11.8 may be expressed as: \begin{equation}
Y_i=\beta _0+\gamma X_{2i}+\beta _3X_{3i}+u_i \tag{11.9}
\end{equation}
The relationship appearing in equation 11.8 can be estimated using an OLS\ procedure. The estimated parameters [latex]\hat \beta _0[/latex] and [latex]\hat \beta _3[/latex] serve as estimators for the original parameters [latex]\beta _0[/latex] and [latex]\beta _3[/latex]. The estimated parameter [latex]\hat \gamma[/latex]  serves as an estimator of  [latex]\beta _1c+\beta _2[/latex]. Unfortunately, it is impossible to estimate the individual parameters [latex]\beta _1[/latex] and [latex]\beta _2[/latex] since the separate effects of [latex]X_{1i}[/latex] and [latex]X_{2i}[/latex] cannot be observed. Instead, only the combined effect of these variables can be measured. The parameter [latex]\gamma[/latex]  captures this combined effect. In this equation, the variable [latex]X_{1i}[/latex] does not appear since it contains no information that is not already captured by the variable [latex]X_{2i}[/latex]. (Of course, it would also have been possible to respecify the model in a manner in which [latex]X_{2i}[/latex] is omitted from the equation instead.)

In general, whenever a perfect multicollinearity problem occurs, it is possible to transform the model into one involving fewer regression parameters through the use of the procedure illustrated above. The transformed model may then be estimated using OLS techniques. While it is not possible to estimate all of the parameters of the original model, it is possible to estimate the coefficients multiplying those variables that are not linear functions of the other variables.

11.2 Multicollinearity

Perfect multicollinearity is a relatively rare occurrence in properly specified econometric models. It is most likely to occur when an econometrician has failed to take linear relationships among the variables into account (as in the dummy variable trap). Quite often, however, there will be a high degree of linear association among two or more independent variables. In this case, an approximate linear relationship exists among two or more of the independent variables in the regression equation. For example, it is possible that relationships such as the following might exist:
\begin{equation}
X_{1i}\approx cX_{2i} \tag{11.10}
\end{equation}
\begin{equation}
X_{1i}\approx aX_{3i}+bX_{4i}+cX_{5i} \tag{11.11} \end{equation}
In cases such as these, \textbf{near-perfect multicollinearity} is said to occur. This situation is also simply called a multicollinearity problem.

11.2.1 Consequences of multicollinearity

When near-perfect multicollinearity occurs, one of the variables is approximately equal to a linear combination of the other variables. In equations 11.10 and 11.11 above, [latex]X_{1i}[/latex] is approximately equal to a linear combination of other independent variables. When this occurs, there is little variation in [latex]X_1[/latex] that is independent of the other independent variables. Consider the relationship, for example, appearing in equation 11.10. Changes in the level of [latex]X_{2i}[/latex] will always tend to be associated with changes in the level of[latex]X_{1i}[/latex]. Under these circumstances, it is difficult to determine the separate effects of [latex]X_1[/latex] and [latex]X_2[/latex] (since the two variables tend to change together). Estimates of the coefficients multiplying the affected variables will be unstable. This will generally result in high standard errors for the estimated coefficients on the affected variables.

A multicollinearity problem should be suspected under the following circumstances:

  • high standard errors (often resulting in low [latex]t[/latex]-ratios) for parameters corresponding to the affected variables,
  • estimates are very sensitive to small changes in the data,
  • unexpected signs on estimated coefficients and/or unreasonable magnitudes,
  • a high R[latex]^{2}[/latex],
  • a significant joint effect of the collinear variables (as measured by an [latex]F[/latex] test for the joint significance of the variables), and
  • high correlations between two or more of the independent variables.

11.2.2 Diagnosis of multicollinearity

If some or all of these conditions hold, it is desirable to test for the presence of multicollinearity. A multicollinearity problem should always be suspected when the [latex]t[/latex]-ratios for each of the independent variables is insignificant, yet an [latex]F[/latex] test indicates that the slope coefficients are jointly significant. When a multicollinearity problem is suspected, a number of tests may be performed for verification purposes. These tests include:

  • an examination of the pairwise correlations among the independent variables;
  • the use of [latex]F[/latex] tests to assess the joint significance of variables that are individually insignificant;
  • the estimation of auxiliary regressions to determine the degree of linear relationship existing among the independent variables; and
  • an examination of the “condition number” for the regression.

Let’s examine each of these techniques.

11.2.3 Pairwise correlations

A first step in detecting a multicollinearity problem involves an inspection of the pairwise correlations among the independent variables. If any two variables have a correlation that is close to one in absolute value, then a multicollinearity problem should definitely be suspected. More generally, pairwise correlations above 0.9 may suggest the presence of multicollinearity. Of course, if the multicollinearity relationship exists among three or more variables (as in equation 11.11), the pairwise correlations might be low even when multicollinearity exists. Thus, high pairwise correlations suggest the presence of multicollinearity, but low pairwise correlations does not rule out the possibility of multicollinearity.

11.2.4 Test of joint significance

Since a multicollinearity problem will often lower the [latex]t[/latex]-ratios for the affected variables, it is useful to conduct a test of the joint significance of these variables when multicollinearity is suspected. Consider the equation:
\begin{equation*}
Y_i=\beta _0+\beta _1X_{1i}+\beta _2X_{2i}+\beta _3X_{3i}+u_i \end{equation*}
Suppose that the estimated parameters [latex]\hat{\beta}_2[/latex] and [latex]\hat{\beta}_3[/latex] have relatively low [latex]t[/latex]-ratios. If it is believed that these low [latex]t[/latex]-ratios are the result of a high correlation existing between these two variables, then an [latex]F[/latex] test can be conducted to test the hypothesis:

H[latex]_0[/latex]: [latex]\beta _2=\beta _3=0[/latex]

If this hypothesis can be rejected, then it is quite possible that the low [latex]t[/latex]-ratios are the result of a multicollinearity problem.

In general, if it is believed that low [latex]t[/latex]-ratios for a set of variables is due to a multicollinearity relationship existing among these variables, an [latex]F[/latex] test can be performed to examine the joint effect of the affected variables.

11.2.5 Auxiliary regressions

If it is suspected that a multicollinearity relationship exists among three or more independent variables, then a simple test for multicollinearity is to estimate auxiliary regressions in which each of the independent variables is specified as a linear function of all of the other independent variables. If the original equation is:
\begin{equation*}
Y_{i}=\beta _{0}+\beta _{1}X_{1i}+\beta _{2}X_{2i}+\ldots +\beta _{k}X_{ki}+u_{i}
\end{equation*}
then there are [latex]k[/latex] auxiliary regression equations: \begin{equation*}
X_{1i}=\alpha _{0}+\alpha _{1}X_{2i}+\alpha _{2}X_{3i}+\ldots +\alpha _{k-1}X_{ki}+v_{i}
\end{equation*}
\begin{equation*}
X_{2i}=\gamma _{0}+\gamma _{1}X_{1i}+\gamma _{2}X_{3i}+\ldots +\gamma _{k-1}X_{ki}+\epsilon _{i}
\end{equation*}
\begin{equation*}
\vdots
\end{equation*}
\begin{equation*}
X_{ki}=\delta _{0}+\delta _{1}X_{1i}+\delta _{2}X_{2i}+\ldots +\delta _{k-1}X_{k-1}+\omega _{i}
\end{equation*}
R[latex]_{j}^{2}[/latex] is defined as the R[latex]^{2}[/latex] corresponding to the [latex]j[/latex]th auxiliary regression above.
High values for one or more of the R[latex]_{j}^{2}[/latex] suggests that a multicollinearity problem may be present. A common rule of thumb is that a multicollinearity problem should be suspected if any one of the R[latex]_{j}^{2}[/latex] is greater than the R[latex]^{2}[/latex] for the original equation.

An alternative measure of multicollinearity based upon these auxiliary regressions is the variance inflation factor (VIF) associated with a given independent variable. The variance inflation factor associated with the [latex]j[/latex]th independent variable is defined as: \begin{equation}
\text{VIF(}\hat{\beta}_{j}\text{)}=\frac{1}{1-\text{R}_{j}^{2}} \tag{11.12}
\end{equation}
An inspection of the definition appearing in equation 11.12 indicates that the variance inflation factor will be relatively high when R[latex]_{j}^{2}[/latex] is relatively large. If the variable [latex]X_{j}[/latex] is uncorrelated with the other independent variables, then R[latex]_{j}^{2}[/latex] will equal zero and VIF([latex]\hat{\beta}_{j}[/latex]) will equal one. As R[latex]_{j}^{2}[/latex] approaches one, however, the corresponding variance inflation factor tends toward infinity. Roughly speaking, the variance inflation factor provides a measure of the effect of multicollinearity on the variance of the parameter estimates. A large value for this measure indicates that the particular variable is highly correlated with the other independent variables.

Both R[latex]_{j}^{2}[/latex] and VIF([latex]\hat{\beta}_{j}[/latex]) provide alternative (and equivalent) methods of detecting which particular variables are the source of a multicollinearity problem. If an econometrician is willing to say that an R_{j}^{2} that exceeds 0.9 constitutes a multicollinearity problem, then (using the definition in equation 11.12) the corresponding threshold for the variance inflation factor is: \begin{equation*}
\text{VIF(}\hat{\beta}_{j}\text{)}=\frac{1}{1-0.9}=10 \end{equation*}
Unfortunately, however, there is no hard and fast rule that can be used to evaluate the extent of a multicollinearity problem. Even if all of the auxiliary regressions have R[latex]_{j}^{2}[/latex] that are greater than or equal to 0.9, an OLS estimation procedure may still generate “reasonable” parameter estimates and significant [latex]t[/latex]-statistics for relevant variables.

11.2.6 Condition number

An alternative test for multicollinearity is based upon the condition number for the set of independent variables.[3] The condition number measures the extent of the linear relationships that exist among all of the independent variables. Larger values of the condition number indicates a higher degree of correlation among the independent variables. Belsley, Kuh, and Welsch, the developers of this index, suggest that a condition number that is greater than 20 indicates the potential for a multicollinearity problem. It is generally agreed that a condition number that is greater than 30 indicates the possibility of a severe multicollinearity problem. The condition number may be automatically computed by SAS and several other regression packages.[4]

Longley data

Most early econometric studies relied on calculations performed on mechanical calculators. The high computational costs involved in regression analysis substantially limited the number of variables that could be included in these regression models. As economists and statisticians acquired access to mainframe computing facilities during the late 1950s and early 1960s the size of models that could be estimated expanded dramatically.

As Longley (1967) notes, however, the first least-squares computer algorithms used on these computers were often taken directly from computational procedures that were designed to lower the cost of manual computation on mechanical calculators. These computational procedures, however, often resulted in relatively large computational errors when they were applied to larger regression models.

To address this issue, Longley collected a set of independent variables that exhibited a high degree of multicollinearity. A desk calculator was used to estimate a total of 165 equations using these independent variables (and a collection of 8 dependent variables). The accuracy of a variety of contemporary regression packages were then assessed against these benchmarks. Most of the commonly used packages provided only 1-5 digit accuracy in the parameter estimates.

Longley’s critique resulted in substantial improvements in the quality of the computer algorithms used in regression analysis. His data and benchmark computations are still widely used to verify the accuracy of regression software.

11.2.7 Example: Longley data

The data used by Longley (1967) to test the accuracy of some of the first regression analysis software packages[5] is known for exhibiting large multicollinearity problems. Let’s examine one of the equations estimated by Longley:
\begin{equation}
\widehat{\text{total\ employment}}_{t}=\underset{(-3.905)}{-3482260}+\underset{(4.010)}{1829.2}\text{Year}_{t}+\underset{(0.177)}{15.062}\text{GNP deflator}_{t} \tag{11.13}
\end{equation}
\begin{equation*}
-\underset{(-1.068)}{0.03582}\text{GNP}_{t}\underset{(-4.131)}{-2.0202}\text{Unemployment}_{t}
\end{equation*}
\begin{equation*}
-\underset{(-4.815)}{1.0332}\text{Armed forces}_{t}-\underset{(-0.226)}{0.0511}\text{Population}_{t}
\end{equation*}

R[latex]^{2}[/latex] = 0.9955
([latex]t[/latex]-ratios in parentheses)

Note that the estimated coefficients on GNP[latex]_{t}[/latex] and Population[latex]_{t}[/latex] are insignificant and have an unexpected negative sign. It is quite possible that these results may be due to multicollinearity. Let’s examine this possibility.

Table 11.1: Pairwise correlations for Longley data
Variable Year GNP deflator GNP Unemp. Armed Forces Pop.
Year 1.0000
GNP deflator 0.9911 1.0000
GNP 0.9953 0.9916 1.0000
Unemployment 0.6683 0.6206 0.6043 1.0000
Armed Forces 0.4172 0.4647 0.4464 -0.1774 1.0000
Population 0.9940 0.9792 0.9911 0.6866 0.3644 1.0000

Table 11.1 contains a listing of the pairwise correlations for the Longley data. An examination of the pairwise correlations among the independent variables range from a low of -0.1774 to a high of 0.9939. Most of the pairwise correlations are greater than 0.90. These high correlations provide some tentative evidence of a multicollinearity problem. Further evidence is provided by the set of auxiliary regressions. The R[latex]_{j}^{2}[/latex] and variance inflation factors for these regressions are:

  • R_{1}^{2}=0.99868, VIF[latex](\hat{\beta}_{1})=\frac{1}{1-0.99868}=757.6[/latex]
  • R_{2}^{2}=0.99262, [latex]VIF(\hat{\beta}_{2}) =\frac{1}{1-0.99262}=135.5[/latex]
  • R_{3}^{2}=0.99944, [latex]VIF(\hat{\beta}_{3}) =\frac{1}{1-0.99944}=1785.7[/latex]
  • R_{4}^{2}=0.97025, [latex]VIF(\hat{\beta}_{4}) =\frac{1}{1-0.97025}=33.6[/latex]
  • R_{5}^{2}=0.72136, [latex]VIF(\hat{\beta}_{5}) =\frac{1}{1-0.72136}=3.6[/latex]
  • R_{6}^{2}=0.99748, [latex]VIF(\hat{\beta}_{6}) =\frac{1}{1-0.99748}=396.8[/latex]

Note that three of these six auxiliary regressions have R[latex]_{j}^{2}[/latex] that exceed the R[latex]^{2}[/latex] in the original regression. Four of these R[latex]_{j}^{2}[/latex] are greater than 0.99. The high values for the R[latex]_{j}^{2}[/latex] and the large variance inflation factors for most variables are consistent with a multicollinearity problem.

The condition number for this regression is equal to 43275. Since this value is rather dramatically larger than 30 (the value that Belsley, Kuh, and Welsch suggested indicated a serious multicollinearity problem), the condition number provides further strong evidence of the presence of multicollinearity.

This evidence strongly suggests the presence of a multicollinearity problem in the Longley data.

11.2.8 Remedial measures

When a multicollinearity problem is present, econometricians are often disappointed to see low [latex]t[/latex]-ratios on variables that they strongly believe are important determinants of the dependent variable. Three methods of dealing with this problem are often suggested:[6]

  • acquire more data,
  • drop one or more of the affected variables from the regression equation, and/or
  • reformulate the model so that the variables in the transformed model exhibit a less severe multicollinearity problem.

Let’s examine each of these possible “remedies.”

11.2.9 Acquiring additional data

An increase in the size of the sample tends to reduce the variance of the estimators. If more observations are available, the standard errors of the estimators will tend to decline. In practice, however, econometricians generally use all readily available observations when they estimate regression models. Since multicollinearity problems are most often found in time-series models, this strategy is often impossible. For many time-series variables, only a limited number of years of data have been recorded by government agencies. Few econometricians are willing to wait several years for more time series observations to become available. Even if additional data is available, however, there is no guarantee that the multicollinearity problem will be less severe when the sample size is increased. In general, however, this strategy should be attempted whenever it is feasible.

11.2.10 Dropping variables from the regression model

Since dropping a variable makes it possible to estimate some of the parameters of a regression model when perfect multicollinearity is present, this is sometimes suggested as a remedy to a multicollinearity problem. Unfortunately, however, there is a major problem with this procedure. If a variable is dropped from an equation in which it belongs, the estimates of the intercept and slope coefficients are subject to a potential omitted variable bias. The magnitude of this bias will, in general, be greater for estimated coefficients on those variables that are highly correlated with the dropped variable.

Econometricians generally prefer to include all theoretically important variables in a regression equation, even if a multicollinearity problem may occur. Consider, for example, the consumption function model given by: \begin{equation}
\text{C}_{t}\text{ = }\beta _{0}+\beta _{1}\text{YD}_{t}+\beta _{2}\text{Wealth}_{t}+u_{t} \tag{11.14}
\end{equation}
where:

  • C[latex]_{t}[/latex] = consumption in year [latex]t[/latex]
  • YD[latex]_{t}[/latex] = disposable personal income in year [latex]t[/latex]
  • Wealth[latex]_{t}[/latex] = household wealth in year [latex]t[/latex]

From a theoretical standpoint, it is expected that consumption expenditures will be affected by both current disposable personal income and the level of household wealth. In practice, though, it is possible that there will be a high degree of correlation between the disposable personal income and wealth variables. An econometrician, though, is likely to include both of these independent variables in the regression equation even if a multicollinearity problem exists since both income and wealth are theoretically important variables.[7] A common way of dealing with such situations is to report the estimated equation and describe the evidence suggesting the presence of multicollinearity.

11.2.11 Model reformulations

First-differences

Since many economic time series contains a significant trend component, many time series variables tend to be highly correlated. Thus, multicollinearity is a common problem in time-series regression models. One possible solution to the multicollinearity problem involves transforming the model in a manner that may reduce the extent of the multicollinearity problem. Suppose, for example, that a time-series regression model is given by: \begin{equation}
Y_{t}=\beta _{0}+\beta _{1}X_{t}+\beta _{2}Z_{t}+u_{t} \tag{11.15} \end{equation}
If the variables [latex]X_{t}[/latex] and [latex]Z_{t}[/latex] both tend to grow over time (as is true of many time-series variables), the levels of [latex]X_{t}[/latex] and [latex]Z_{t}[/latex] may be highly correlated and a multicollinearity problem may result.

Note that if equation 11.15 holds in each time period, then: \begin{equation}
Y_{t-1}=\beta _0+\beta _1X_{t-1}+\beta _2Z_{t-1}+u_{t-1} \tag{11.16} \end{equation}
Subtracting equation 11.16 from equation 11.15 results in:
\begin{equation}
\Delta Y_t=\beta _1\Delta X_t+\beta _2\Delta Z_t+v_t \tag{11.17} \end{equation}

where:

  • [latex]\Delta Y_t=Y_t-Y_{t-1}[/latex]
  • [latex]\Delta X_t=X_t-X_{t-1}[/latex]
  • [latex]\Delta Z_t=Z_t-Z_{t-1}[/latex]
  • [latex]v_t=u_t-u_{t-1}[/latex]

The transformed equation appearing in equation 11.16 is said to be expressed in first-differenced form.

While the levels of [latex]X_t[/latex] and [latex]Z_t[/latex] may be highly correlated, the changes in these variables ([latex]\Delta X_t[/latex] and [latex]\Delta Z_t[/latex] often exhibit less correlation. Thus, one solution to the multicollinearity problem involves estimating the transformed model appearing in equation 11.17 instead of the original equation appearing in equation 11.15.

The use of a first-differencing procedure may reduce multicollinearity, but may result in other more serious problems. In particular, if all of the assumptions of the classical regression model are satisfied for the original model, the transformed error term, [latex]v_t[/latex], is likely to be correlated across time.[8] As will be shown in Chapter 12, OLS estimates of equation 11.17 will be subject to some potentially serious problems when such a correlation exists. For now, it can be simply noted that this “cure” for multicollinearity may result in problems that are more serious than the initial “disease.”[9]

Polynomial regression models

Consider the following polynomial regression model: \begin{equation}
Y_{i}=\beta _{0}+\beta _{1}X_{i}+\beta _{2}X_{i}^{2}+\cdots +\beta _{k}X_{i}^{k}+u_{i} \tag{11.18}
\end{equation}

One of the problems associated with estimating polynomial regression models of this sort is that the independent variables in this model are often highly correlated in most practical applications. As noted by Bradley and Srivastava (1979), this multicollinearity problem is reduced if $X_{i}$ is initially expressed as deviations from the sample mean before the higher-order terms are created. In other words, the transformed model is: \begin{equation}
Y_{i}=\gamma _{0}+\gamma _{1}\widetilde{X}_{i}+\gamma _{2}\widetilde{X}_{i}^{2}+\cdots +\gamma _{k}\widetilde{X}_{i}^{k}+u_{i} \tag{11.19}
\end{equation}

where:

[latex]\widetilde{X}_{i}=X_{i}-\overline{X}[/latex]

[latex]\widetilde{X}_{i}^{2}=\left( X_{i}-\overline{X}\right) ^{2}[/latex]

[latex] \vdots[/latex]

[latex]\widetilde{X}_{i}^{k}=\left( X_{i}-\overline{X}\right) ^{k}[/latex]

It should be noted that when the Bradley and Srivastava transformation is applied, the estimated coefficients of the transformed equation will, in general, differ from the coefficients of the untransformed equation. To see this, consider the use of this transformation in a quadratic model. Suppose the original model is given by:
\begin{equation}
Y_{i}=\beta _{0}+\beta _{1}X_{i}+\beta _{2}X_{i}^{2}+u_{i} \tag{11.20} \end{equation}
and the transformed model is given by:
\begin{equation}
Y_{i}=\gamma _{0}+\gamma _{1}\widetilde{X}_{i}+\gamma _{2}\widetilde{X}_{i}^{2}+u_{i} \tag{11.21}
\end{equation}
This transformed model in equation 11.21 can be restated as: \begin{equation*}
Y_{i}=\gamma _{0}+\gamma _{1}\left( X_{i}-\overline{X}\right) +\gamma _{2}\left( X_{i}-\overline{X}\right) ^{2}+u_{i} \end{equation*}
With a little bit of algebraic manipulation, this becomes: \begin{equation*}
Y_{i}=\gamma _{0}+\gamma _{1}X_{i}-\gamma _{1}\overline{X}+\gamma _{2}X_{i}^{2}-2\gamma_{2}\overline{X}X_{i}+\gamma _{2}\overline{X}^{2}+u_{i} \end{equation*}
or:
\begin{equation}
Y_{i}=(\gamma _{0}-\gamma _{1}\overline{X}+\gamma _{2}\overline{X}^{2})+(\gamma _{1}-2\gamma _{2}\overline{X})X_{i}+\gamma _{2}X_{i}^{2}+u_{i} \tag{11.22}
\end{equation}
A comparison of equation 11.22 with equation 11.20 indicates that the relationship between the [latex]\beta _{j}[/latex] and the [latex]\gamma_{j}[/latex] in a quadratic model is given by:
\begin{equation*}
\beta _{0}=\gamma _{0}-\gamma _{1}\overline{X}+\gamma _{2}\overline{X}^{2} \end{equation*}
\begin{equation*}
\beta _{1}=\gamma _{1}-2\gamma _{2}\overline{X} \end{equation*}
and
\begin{equation*}
\beta _{2}=\gamma _{2}
\end{equation*}
In general, when this procedure is applied to a [latex]k^{\text{th}}[/latex]-order polynomial, only the coefficient on [latex]X^{k}[/latex] will be expected to remain the same.

If the researcher wishes to recover estimates of the original [latex]\beta _{j}[/latex] parameters in this model, this may be achieved by using the relationships: \begin{equation*}
\hat{\beta}_{0}=\hat{\gamma}_{0}-\hat{\gamma}_{1}\overline{X}+\hat{\gamma}_{2}\overline{X}^{2}
\end{equation*}
\begin{equation*}
\hat{\beta}_{1}=\hat{\gamma}_{1}-2\hat{\gamma}_{2}\overline{X} \end{equation*}
and
\begin{equation*}
\hat{\beta}_{2}=\hat{\gamma}_{2}
\end{equation*}

A similar procedure may be used to recover the original coefficients when the Bradley and Srivastava transformation is applied to cubic and higher-order polynomial models.

Goldberger on Multicollinearity

Goldberger (1991) has suggested that econometric textbooks have tended to overemphasize the problem of multicollinearity. As he notes, the problem of multicollinearity is quite analagous to the problem of a small sample size. In both situations, econometricians do not have enough information to generate very precise parameter estimates. Goldberger suggests that the problem may simply be that “multicollinearity” sounds like a more severe problem than “small sample size.”

In a somewhat tongue-in-cheek manner, he suggests that the issue of “small sample size” would receive more detailed treatment if it were renamed as “micronumerosity.” As Goldberger states this:

Tests for the presence of micronumerosity require the judicious use of various fingers. Some researchers prefer a single finger, others use their toes, still others let their thumbs rule.

A generally reliable guide may be obtained by counting the number of observations. Most of the time in econometric analysis, when [latex]n[/latex] is close to zero, it is also far from infinity. (Goldberger, 1991, p. 249)

 

11.3 Is multicollinearity a “problem?”

When multicollinearity is present, the standard errors for some of the estimated coefficients are relatively high. As long as all of the conditions of the classical regression model are satisfied, however, these estimators are still unbiased, consistent, and BLUE. Thus, there are no other linear unbiased estimators that have lower standard errors than the OLS estimators. For this reason, it is generally advisable to estimate the model including all relevant variables. When a multicollinearity problem is present, omitting one or more of the collinear variables will improve [latex]t[/latex]-ratios on the remaining variables. Since this “improvement” comes at the expense of potential bias in the parameter estimates and in the [latex]t[/latex]-ratios themselves, it is probably not worth the cost.

Many beginning econometricians are overly concerned about the possibility of multicollinearity. After finding high correlations between pairs of variables, there is a temptation to exclude one or more variables from the regression. This is a temptation that should generally be avoided.

Students are often surprised to find significant [latex]t[/latex]-ratios and “reasonable” signs and magnitudes even when there is some indication of multicollinearity.

11.4 Summary

It is not possible to estimate all intercept and slope parameters in the multiple regression model when perfect multicollinearity is present. In most cases, however, the model can be transformed so that many of the model parameters may be estimated. Typically, perfect multicollinearity is the result of a poorly designed econometric model.

While perfect multicollinearity is a rare phenomena in well designed models, near-perfect multicollinearity (generally referred to simply as “multicollinearity”) is quite common. The existence of multicollinearity makes it difficult to obtain reliable parameter estimates for some or all of the model parameters. This will often result in equations in which many or all of the independent variables will have low $t$-ratios, but the joint effect of the affected variables may be highly significant. A number of alternative methods exist for diagnosing the existence of a multicollinearity problem.
While a number of potential remedies for multicollinearity exist, the best solution often involves doing nothing more than reporting the possible presence of multicollinearity. Multicollinearity involves a problem with the data, not a problem with the econometric model. As long as all of the conditions of the classical regression model are satisfied, OLS estimators are BLUE.

11.5 Key Concepts

  • perfect multicollinearity
  • multicollinearity problem
  • auxiliary regressions
  • R[latex]_{j}^{2}[/latex]
  • variance inflation factor (VIF)
  • condition number
  • first-differenced form

11.6 Exercises and problems

  1. Consider the regression model:\begin{equation*}
    Y_i=\beta _o+\beta _1X_{1i}+\beta _2X_{2i}+\beta _3X_{3i}+u_i \end{equation*}
    in which [latex]X_{3i}=X_{1i}+X_{2i}[/latex].

    1. Transform this model into one that makes it possible to estimate some of the parameters using OLS regression techniques.
    2. Which of the parameters in the original equation may be estimated? Explain.
  2. Provide an intuitive explanation of why parameter estimates are unreliable when a multicollinearity problem is present.
  3. An econometrician estimates the following equation: \begin{equation*}\hat{Y}_i=\underset{(2.72)}{24.51}+\underset{(0.21)}{92.34}X_i-\underset{% (-0.91)}{12.43}Z_i\end{equation*}\begin{equation*}\text{(}t\text{-ratios in parentheses)}\end{equation*}\begin{equation*}R^2=0.974\end{equation*}
    1. Are any econometric problems apparent? Explain.
    2. What tests could you perform to test for this problem?
  4. Explain why auxiliary regressions help to identify the presence of a multicollinearity problem.
  5. Hause (1972) analyzes the interactions between ability and education using a variety of samples. One of the estimated equations reported in this study is:\begin{equation*}\ln (\text{earnings}_{i})=\beta _{0}-\underset{(0.010)}{0.001}\text{IQ}_{i}+\underset{(0.078)}{0.008}\text{YS}_{i}+\underset{(0.001)}{0.001}\left( \text{IQ}_{i}\times \text{YS}_{i}\right) +\cdots\end{equation*}\begin{equation*}\text{(standard errors in parentheses)}\end{equation*}

where:

  • earnings[latex]_{i}[/latex] = 1965 earnings of person [latex]i[/latex]
  • IQ[latex]_{i}[/latex] = IQ score of person [latex]i[/latex]
  • YS[latex]_{i}[/latex] = years of schooling for person [latex]i[/latex]

These estimates were derived using a sample of 343 white males.

    1. What effect is measured by including the interaction term in this equation?
    2. Which of these coefficients are significant at a 5% significance level?
    3. Hause reports that the coefficient on the interaction term is highly significant when either the IQ[latex]_i[/latex] or YS[latex]_i[/latex] variable is dropped from the regression. What might account for this result?
    4. Is such a situation likely whenever a similar specification is used in a regression equation?
  1. Use the data in the file “ceo.dat” to estimate the parameters of the following equation:
    \begin{equation}
    \text{Totcomp}_{i}=\beta _{0}+\beta _{1}\text{Profits}_{i}+\beta _{2}\text{Sales}_{i}+u_{i} \tag{11.23}
    \end{equation}

where:

  • Totcomp[latex]_{i}[/latex] = Total compensation of CEO in firm [latex]i[/latex]
  • Profits[latex]_{i}[/latex] = Profits of firm [latex]i[/latex]
  • Sales[latex]_{i}[/latex] = Sales of firm [latex]i[/latex]

This is the basic model used by Ciscel and Carroll (1980) to investigate the determinants of CEO salaries.

    1. What does economic theory suggest about the signs of [latex]\beta _{1}[/latex] and [latex]\beta _{2}[/latex]
    2. Estimate the parameters of equation 11.23 using an OLS estimation procedure. Do the signs agree with expectations? Are the coefficients statistically significant at a .05 significance level? (Be sure to explain whether a one- or two-tailed hypothesis test is used in this case.)
    3. Check for the possible presence of multicollinearity by examining the correlation between Profits[latex]_{i}[/latex] and Sales[latex]_{i}[/latex].
    4. Estimate the parameters of the following two equations: \begin{equation*}\text{Salary}_{i}=\alpha _{0}+\alpha _{1}\text{Profits}_{i}+v_{i} \end{equation*} and: \begin{equation*} \text{Salary}_{i}=\gamma _{0}+\gamma _{1}\text{Sales}_{i}+\epsilon _{i} \end{equation*} Compare these estimates and [latex]t[/latex]-ratios with those appearing in equation 11.23.
  1. Use the data in the file “ceo.dat” to estimate the parameters of the following equation:
    \begin{equation}
    \text{Totcomp}_{i}=\beta _{0}+\beta _{1}\text{Profits}_{i}+\beta _{2}\text{Sales}_{i}+u_{i} \tag{11.24} \end{equation}

where:

  • Totcomp[latex]_{i}[/latex] = Total compensation of CEO in firm [latex]i[/latex]
  • Profits[latex]_{i}[/latex] = Profits of firm [latex]i[/latex]
  • Sales[latex]_{i}[/latex] = Sales of firm [latex]i[/latex]

This is the basic model used by Ciscel and Carroll (1980) to investigate the determinants of CEO salaries.

    1. Do the signs of the estimated coefficients agree with expectations? Are the coefficients statistically significant at a .05 significance level? (Be sure to explain whether a one- or two-tailed hypothesis test is used in this case.)
    2. Estimate the parameters of the following two equations: \begin{equation*}
      \text{Profits}_{i}=a_{0}+a_{1}\text{Sales}_{i}+e_{i} \end{equation*}
      and:
      \begin{equation*}
      \text{Sales}_{i}=b_{0}+b_{1}\text{Profit}+\eta _{i} \end{equation*}
      and store the estimated residuals, [latex]\hat{e}_{i}[/latex] and [latex]\hat{\eta}_{i}[/latex].
    3. What do [latex]\hat{e}_i[/latex] and [latex]\hat{\eta}_{i}[/latex] measure?
    4. Estimate the parameters of the following two equations: \begin{equation*}\text{Salary}_{i}=\alpha _{0}+\alpha _{1}\text{Profits}_{i}+\alpha _{1}\hat{\eta}_{i}+v_{i}\end{equation*}and:\begin{equation*}\text{Salary}_{i}=\gamma _{0}+\gamma _{1}\hat{e}_{i}+\gamma _{2}\text{Sales}_{i}+\epsilon _{i}\end{equation*}Compare these estimates and [latex]t[/latex]-ratios with those appearing in equation 11.23.
  1. Use the Longley data in the file “longley.dat” to:
    1. estimate the parameters of:
      \begin{equation}
      \text{Self-employed workers}_{t}=\beta _{0}+\beta _{1}\text{Year}_{t}+\beta _{2}\text{GNP deflator}_{t}\tag{11.25} \end{equation}
      \begin{equation*}
      +\beta _{3}\text{GNP}_{t} +\beta _{4}\text{Unemployment}_{t}+\beta _{5}\text{Armed forces}_{t}
      \end{equation*}
      \begin{equation*}+\beta _{6}\text{Population}_{t}+u_{t}\end{equation*}
    2. Estimate six alternative specifications in which a different independent variable is eliminated from equation 11.25. Do the estimated coefficients and [latex]t[/latex]-ratios appear to be very sensitive to the choice of model specification?
    3. Do your results in (a) and (b) provide any suggestions for excluding one or more of the variables in equation 11.25? If so, which variable can be eliminated? If not, explain why the specification should not be changed.
  2. Use the Longley data in the file “longley.dat” to:
    1. estimate a relationship of the form: \begin{equation*}\text{self-employed}_{t}=\beta _{0}+\beta _{1}\text{Year}_{t}+\beta _{2}\text{GNPdeflator}_{t}+\beta_{3}\text{GNP}_{t} \end{equation*}\begin{equation*}+\beta _{4}\text{Unemployment}_{t}+\beta _{5}\text{Armed forces}_{t} \end{equation*}\begin{equation*}+\beta _{6}\text{Population}_{t}+u_{t} \end{equation*}
    2. estimate the same relationship using only the last 15 observations. Did the estimates change substantially? If so, why did this occur?
  3. Use the “cons2.dat” file to:
    1. estimate the parameters of equation 11.14.
    2. Estimate an auxiliary regression equation and calculate the variance inflation factor to test for the presence of multicollinearity. Is there evidence of multicollinearity? If so, does it appear to be a serious problem? Why or why not?
  4. Consider the following model used to predict combined SAT scores from other test scores:\begin{equation}\text{SAT}_{i}=\beta _{0}+\beta _{1}\text{HSRANK}_{i}+\beta _{2}\text{READING}_{i}+\beta _{3}\text{VOCAB}_{i} \label{sat.eq.mult} \end{equation}\begin{equation*}+\beta _{4}\text{PICT}_{i}+\beta _{5}\text{LGSC}_{i}+\beta _{6}\text{MATH}_{i}\end{equation*}\begin{equation*}\beta _{7}\text{MOSAIC}_{i}+u_{i} \end{equation*}

where:

  • SAT[latex]_{i}[/latex] = combined verbal and math SAT scores for person [latex]i[/latex]
  • HSRANK[latex]_{i}[/latex] = high school class rank for person [latex]i[/latex] (expressed as percentile)
  • READING[latex]_{i}[/latex] = scaled reading score for person [latex]i[/latex]
  • VOCAB[latex]_{i}[/latex] = scaled vocabulary score for person [latex]i[/latex]
  • PICT[latex]_{i}[/latex] = scaled picture-number score for person [latex]i[/latex]
  • LGSC[latex]_{i}[/latex] = scaled letter groups score for person [latex]i[/latex]
  • MATH[latex]_{i}[/latex] = scaled math score for person [latex]i[/latex]
  • MOSAIC[latex]_{i}[/latex] = scaled mosaic comparison score for person [latex]i[/latex]
The variables READING, VOCAB, PICT, LGSC, MATH and MOSAIC are standardized tests scores on exams that were administered to all participants in the National Longitudinal Study of the High School Class of 1972. (Each of these tests is scaled so that the mean is 50 and standard deviation is 10.)
    1. What predictions can be made about the signs of each of the slope parameters?
    2. Use the data in the file “sat1.dat” to estimate the parameters of this equation.
    3. Are there any reasons to suspect multicollinearity?
    4. Use auxiliary regressions to test for the presence of multicollinearity. Report the adjusted R[latex]^{2}[/latex] for each auxiliary regression and the variance inflation factor for each variable. What do you conclude?
    5. Perform an [latex]F[/latex] test to examine the joint effect of all insignificant variables (at a 5% significance level).
  1. An econometrician attempts to estimate a consumption function for an economy of the form:\begin{equation*}C_{t}=\beta _{0}+\beta _{1}YD_{t}+\beta _{2}\text{wealth}_{t}+\beta _{3}\text{interest rate}_{t}+u_{t}\end{equation*} When she collects time series data, however, she discovers that the government had kept interest rates constant at 4% for the entire sample period.
    1. What is the interpretation of the coefficient [latex]\beta_{3}[/latex]?
    2. Does this present any problem in estimating the parameters of this consumption function? Explain.
    3. If a problem exists, reformulate the model so that some of the parameters can be estimated.
  2. An econometrician attempts to estimate the following equation using a sample containing only males:\begin{equation*}\ln (\text{earnings}_{i})=\beta _{0}+\beta _{2}\text{experience}_{i}+\beta_{2}\text{experience}_{i}^{2}\end{equation*}\begin{equation*}+\beta _{3}\text{education}_{i}+\beta _{4}\text{male}_{i}+u_{i}\end{equation*} Can all of the parameters of this model be estimated? Explain.
  3. An econometrician specifies a regression model as:\begin{equation*}Y_{i}=\beta _{0}+\beta _{1}X_{i}+\beta _{2}Z_{i}+u_{i}\end{equation*}It is discovered that the variable [latex]X_{i}[/latex]contains a substantial amount of measurement error. If an instrumental variables estimator is used, can the variable [latex]Z_{i}[/latex] be used as the sole instrument used in this procedure? Why or why not?
  4. Suppose that an econometrician attempts to explain the behavior of a time series variable using the following cubic equation: \begin{equation*}Y_{t}=\beta _{0}+\beta _{1}\text{time}_{t}+\beta _{2}\text{time}_{t}^{2}+\beta _{3}\text{time}_{t}^{3}+u_{t}\end{equation*} where time[latex]_{t}[/latex] is a time trend variable that takes on the values [latex]1, 2, \ldots , 30[/latex] for the 30 years of available data.
    1. Use a spreadsheet or regression software package to generate the first 30 observations for the variables: time[latex]_t[/latex], time[latex]_t^2[/latex], and time[latex]_t^3[/latex].
    2. Examine the possibility of multicollinearity existing among these variables by:
      1. examining the pairwise correlations that exist between each pair of these independent variables,
      2. estimating auxiliary regressions and examining the R[latex]^2[/latex] for these independent variables.
    3. What do the results from (a) suggest about the possibility of multicollinearity existing among these variables
    4. Repeat (a) using 100 observations. Does the use of more data reduce the extent of the multicollinearity problem in this case?
  5. Suppose that an econometrician attempts to explain the behavior of a time series variable using the following cubic equation: \begin{equation*}Y_t=\beta _0+\beta _1\text{time}_t+\beta _2\text{time}_t^2+\beta _3\text{time}_t^3+u_t\end{equation*} where time[latex]_t[/latex] is a time trend variable that takes on the values [latex]1, 2,\ldots , 30[/latex] for the 30 years of available data.
    1. Use a spreadsheet or regression software package to generate the first 30 observations for the variables: time[latex]_t[/latex], time[latex]_t^2[/latex], and time[latex]_t^3[/latex].
    2. Examine the possibility of multicollinearity existing among these variables by:
      1. examining the pairwise correlations that exist between each pair of these independent variables,
      2. estimating auxiliary regressions and examining the R[latex]^2[/latex] for these independent variables.
    3. What do the results from (a) suggest about the possibility of multicollinearity existing among these variables.
    4. Repeat (a) after transforming the time value so that it is expressed as deviations from the sample mean. In other words, use the following time variables:\begin{equation*}\widetilde{\text{time}_t}=\text{time}_t-\overline{\text{time}}_t\end{equation*}\begin{equation*}\widetilde{\text{time}_t}^2=\left( \widetilde{\text{time}_t}\right) ^2\end{equation*}\begin{equation*}\widetilde{\text{time}_t}^3=\left( \widetilde{\text{time}_t}\right) ^3\end{equation*} in place of the original variables. Does the use of this transformation reduce the extent of the multicollinearity problem in this case?
  6. Use the data contained in the file “nls72.dat” to estimate the parameters of the following equation:\begin{equation}\ln (\text{earnings}_i)=\beta _0+\beta _1\text{experience}_i+\beta _2\text{experience}_i^2+\beta _3\text{experience}_i^3  \tag{11.27}\end{equation} \begin{equation*}+\beta _4\text{male}_i+\beta _5\text{SC}_i\text{+}\beta _6\text{CD}_i+\beta_7\text{PhD}_i\end{equation*} \begin{equation*}+\beta _9\text{Black}_i+\beta _{10}\text{Hisp}_i+u_i\end{equation*}

where:

  • earnings[latex]_i[/latex] = annual earnings for individual [latex]i[/latex] in 1985
  • experience[latex]_i[/latex] = months of work experience at two most recent jobs
  • male[latex]_i[/latex] = 1 if male, = 0 if female
  • SC[latex]_i[/latex] = 1 if the highest level of educational attainment is 1-3 years of college
  • CD[latex]_i[/latex] = 1 if the highest level of educational attainment is a bachelor’s degree
  • MA[latex]_i[/latex] = 1 if the highest level of educational attainment is a master’s degree
  • PhD[latex]_i[/latex] = 1 if the highest level of educational attainment is a Ph.D. (or equivalent)
  • Black[latex]_i[/latex] = 1 if individual i is African-American
  • Hisp[latex]_i[/latex] = 1 if individual [latex]i[/latex] is Hispanic
    1. Estimate the parameters of equation 11.17 using an OLS estimation procedure.
    2. Investigate the possibility of multicollinearity among the experience variables by examining the pairwise correlations existing among the experience[latex]_i[/latex], experience[latex]_i^2[/latex], and experience[latex]_i^3[/latex] terms.
    3. Investigate the possibility of multicollinearity by estimating three auxiliary equations in which the dependent variables are the three experience variables: experience[latex]_i[/latex], experience[latex]_i^2[/latex], and experience[latex]_i^3[/latex].
    4. What do the results in (b) and (c) suggest about the presence of multicollinearity among these variables?

  1. As noted in Chapter 9, the dummy variables corresponding to all categories of a qualitative variable such as gender may be included in the regression if the constant term is omitted from the regression equation. The standard practice, however, is to include a constant term in the regression unless there is a compelling theoretical reason to omit the constant term.
  2. Some regression packages will abort the estimation process and report the presence of perfect multicollinearity. Others will drop one or more of the affected variables and estimate the parameters of the simplified model.
  3. A full discussion of this concept requires the use of matrix tools that are beyond the scope of this text. For those readers possessing a background in matrix algebra, the condition number is defined as the square root of the ratio of the largest and smallest eigenvalues for the X'X matrix (where each column of X consists of the observations on a given independent variable). Readers with a solid background in matrix algebra may wish to refer to Belsley, Kuh, and Welsch (1980).
  4. SAS and several packages report a series of condition indexes when multicollinearity diagnostics are requested. The condition number is equal to the largest of the condition index values.
  5. This data is contained in the file ``longleydat.''
  6. Another possible solution involves the use of a method known as ``ridge regression.'' This method makes it possible to form estimators that have a lower mean-squared error than the OLS estimators. Ridge regression estimators, however, are biased. A discussion of this technique requires mathematical methods that are beyond the scope of this text. The interested reader may find a discussion of ridge regression methods in more advanced econometrics texts. See, for example, the discussion in Greene, (2000), p. 258; Johnston (1984), p. 252; or Chow (1983), pp. 97-98.
  7. One of the end-of-chapter exercises asks the reader to determine whether a multicollinearity problem is likely to exist when equation 11.14 is estimated using annual U.S. time-series data.
  8. To see this, note that:\begin{equation*}v_t=u_t-u_{t-1}\end{equation*} and: \begin{equation*} v_{t-1}=u_{t-1}-u_{t-2} \end{equation*} Since both [latex]v_t[/latex] and [latex]v_{t-1}[/latex] include the effect of [latex]u_{t-1}, v_t[/latex] and [latex]v_{t-1}[/latex] will, in general, be correlated.
  9. This procedure also results in a loss in some sample information since one observation is lost when the differencing operation is applied. Suppose, for example, that the original data set consists of time series contains 40 years of time-series data for the period 1956-1996. The differenced variables can only be computed for the years 1957-1996 (since the first observation cannot be computed without knowledge of the value of all of the variables in 1955).

License

Icon for the Creative Commons Attribution 4.0 International License

Econometrics Copyright © by John Kane is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.