What if independent variables are correlated




















So, if you are only interested in prediction, multicollinearity is not a problem. There are two popular ways to measure multicollinearity: 1 compute a coefficient of multiple determination for each independent variable, or 2 compute a variance inflation factor for each independent variable. In the previous lesson , we described how the coefficient of multiple determination R 2 measures the proportion of variance in the dependent variable that is explained by all of the independent variables.

If we ignore the dependent variable, we can compute a coefficient of multiple determination R 2 k for each of the k independent variables. We do this by regressing the k th independent variable on all of the other independent variables.

That is, we treat X k as the dependent variable and use the other independent variables to predict X k. How do we interpret R 2 k? If R 2 k equals zero, variable k is not correlated with any other independent variable; and multicollinearity is not a problem for variable k.

As a rule of thumb, most analysts feel that multicollinearity is a potential problem when R 2 k is greater than 0. The variance inflation factor is another way to express exactly the same information found in the coefficient of multiple correlation.

A variance inflation factor is computed for each independent variable, using the following formula:. In many statistical packages e. In MiniTab, for example, the variance inflation factor can be displayed as part of the regression coefficient table. The interpretation of the variance inflation factor mirrors the interpretation of the coefficient of multiple determination.

As a rule of thumb, multicollinearity is a potential problem when VIF k is greater than 4; and, a serious problem when it is greater than The output above shows a VIF of 2.

Bottom line: If R 2 k is greater than 0. And significance tests on those coefficients may be misleading. If you only want to predict the value of a dependent variable, you may not have to worry about multicollinearity.

Multiple regression can produce a regression equation that will work for you, even when independent variables are highly correlated. The problem arises when you want to assess the relative importance of an independent variable with a high R 2 k or, equivalently, a high VIF k.

In this situation, try the following:. Note: Multicollinearity only affects variables that are highly correlated. If the variable you are interested in has a small R 2 j , statistical analysis of its regression coefficient will be reliable and informative.

That analysis will be valid, even when other variables exhibit high multicollinearity. In this section, two problems illustrate the role of multicollinearity in regression analysis. In Problem 1, we see what happens when multicollinearity is small; and in Problem 2, we see what happens when multicollinearity is big. In the previous lesson , we used data from the table to develop a least-squares regression equation to predict test score.

We also conducted statistical tests to assess the contribution of each independent variable i. The two approaches are equivalent; so, in practice, you only need to do one or the other, but not both. In the previous lesson , we showed how to compute a coefficient of multiple determination with Excel, and how to derive a variance inflation factor from the coefficient of multiple determination.

Here are the variance inflation factors and the coefficients of multiple determination for the present problem. We have rules of thumb to interpret VIF k and R 2 k. Multicollinearity makes it hard to interpret the statistical significance of the regression coefficient for variable k when VIF k is greater than 4 or when R 2 k is greater than 0.

Since neither condition is evident in this problem, we can safely accept the results of statistical tests on regression coefficients. We actually conducted those tests for this problem in the previous lesson. For convenience, key results are reproduced below:. The p-values for IQ and for Study Hours are statistically significant at the 0. The terms "independent" and "dependent" variable are less subject to these interpretations as they do not strongly imply cause and effect.

In correlation analysis, we estimate a sample correlation coefficient , more specifically the Pearson Product Moment correlation coefficient. The sample correlation coefficient, denoted r,. The correlation between two variables can be positive i.

The sign of the correlation coefficient indicates the direction of the association. The magnitude of the correlation coefficient indicates the strength of the association. A correlation close to zero suggests no linear association between two continuous variables. It is important to note that there may be a non-linear association between two continuous variables, but computation of a correlation coefficient does not detect this.

Therefore, it is always important to evaluate the data carefully before computing a correlation coefficient. Graphical displays are particularly useful to explore associations between variables. The figure below shows four hypothetical scenarios in which one continuous variable is plotted along the X-axis and the other along the Y-axis.

A small study is conducted involving 17 infants to investigate the association between gestational age at birth, measured in weeks, and birth weight, measured in grams.

We wish to estimate the association between gestational age and infant birth weight. In this example, birth weight is the dependent variable and gestational age is the independent variable. The data are displayed in a scatter diagram in the figure below. Each point represents an x,y pair in this case the gestational age, measured in weeks, and the birth weight, measured in grams. Note that the independent variable, gestational age is on the horizontal axis or X-axis , and the dependent variable birth weight is on the vertical axis or Y-axis.

The scatter plot shows a positive or direct association between gestational age and birth weight. Infants with shorter gestational ages are more likely to be born with lower weights and infants with longer gestational ages are more likely to be born with higher weights.

The variances of x and y measure the variability of the x scores and y scores around their respective sample means of X and Y considered separately. The covariance measures the variability of the x,y pairs around the mean of x and mean of y, considered simultaneously. To compute the sample correlation coefficient, we need to compute the variance of gestational age, the variance of birth weight, and also the covariance of gestational age and birth weight.

To compute the variance of gestational age, we need to sum the squared deviations or differences between each observed gestational age and the mean gestational age. The computations are summarized below. The variance of birth weight is computed just as we did for gestational age as shown in the table below. To compute the covariance of gestational age and birth weight, we need to multiply the deviation from the mean gestational age by the deviation from the mean birth weight for each participant, that is:.

Notice that we simply copy the deviations from the mean gestational age and birth weight from the two tables above into the table below and multiply. In practice, meaningful correlations i. There are also statistical tests to determine whether an observed correlation is statistically significant or not i. Procedures to test whether an observed sample correlation is suggestive of a statistically significant correlation are described in detail in Kleinbaum, Kupper and Muller.

Regression analysis is a widely used technique which is useful for many applications. We introduce the technique here and expand on its uses in subsequent modules. Simple linear regression is a technique that is appropriate to understand the association between one independent or predictor variable and one continuous dependent or outcome variable.

In regression analysis, the dependent variable is denoted Y and the independent variable is denoted X. When there is a single continuous dependent variable and a single independent variable, the analysis is called a simple linear regression analysis.

This analysis assumes that there is a linear association between the two variables. If a different relationship is hypothesized, such as a curvilinear or exponential relationship, alternative regression analyses are performed. The figure below is a scatter diagram illustrating the relationship between BMI and total cholesterol.

Each point represents the observed x, y pair, in this case, BMI and the corresponding total cholesterol measured in each participant. Note that the independent variable BMI is on the horizontal axis and the dependent variable Total Serum Cholesterol on the vertical axis.

The graph shows that there is a positive or direct association between BMI and total cholesterol; participants with lower BMI are more likely to have lower total cholesterol levels and participants with higher BMI are more likely to have higher total cholesterol levels.

For either of these relationships we could use simple linear regression analysis to estimate the equation of the line that best describes the association between the independent variable and the dependent variable. The simple linear regression equation is as follows:.

The Y-intercept and slope are estimated from the sample data, and they are the values that minimize the sum of the squared differences between the observed and the predicted values of the outcome, i. These differences between observed and predicted values of the outcome are called residuals. The estimates of the Y-intercept and slope minimize the sum of the squared residuals, and are called the least squares estimates.

Conceptually, if the values of X provided a perfect prediction of Y then the sum of the squared differences between observed and predicted values of Y would be 0. That would mean that variability in Y could be completely explained by differences in X.

However, if the differences between observed and predicted values are not 0, then we are unable to entirely account for differences in Y based on X, then there are residual errors in the prediction. The residual error could result from inaccurate measurements of X or Y, or there could be other variables besides X that affect the value of Y.

Based on the observed data, the best estimate of a linear relationship will be obtained from an equation for the line that minimizes the differences between observed and predicted values of the outcome. The Y-intercept of this line is the value of the dependent variable Y when the independent variable X is zero. The slope of the line is the change in the dependent variable Y relative to a one unit change in the independent variable X. The least squares estimates of the y-intercept and slope are computed as follows:.

These are computed as follows:. Because a BMI of zero is meaningless, the Y-intercept is not informative. For example, if we compare two participants whose BMIs differ by 1 unit, we would expect their total cholesterols to differ by approximately 6. For example, suppose a participant has a BMI of We would estimate their total cholesterol to be



0コメント

  • 1000 / 1000