# Errors in variables when doing a regression

If there is measurement error in a predictor (x) it follows that the slope and intercept will not converge to their true values and be biasedly estimated. For example the slope will converge to slope / (Variance(x) + Variance(Measurement Error of x)) which will lead to an underestimate of the slope in the presence of measurement error since its variance will be non-zero.

Klauer KC, Draine SC and Greenwald A G (1998) An unbiased errors-in-variables approach to detecting unconscious cognition. *British Journal of Mathematical and Statistical Psychology* **51** 253-267 present a method for estimating the slope and intercept and their standard errors adjusting for measurement error. This can be estimated using a FORTRAN program used by the authors. Their approach uses a truncated Normal distribution which assumes no negative x values to estimate the slope of x and intercept. The authors claim in their discussion in this paper that such an assumption is robust to differing distributions of x (exponential and uniform) which assume negative values.

R has a Bayesian procedure, leiv, which uses a Cauchy prior for the slope combined with a likelihood function using the standard deviations of (predictor) x and (outcome) y and their correlation from the data to produce posterior distributions for the slope and intercept adjusted for measurement error. The procedure can also use these posterior distributions to produce median values and credible regions for the slope and intercept. The Bayesian procedure is described in Leonard D. (2011) Estimating a bivariate linear relationship *Bayesian Analysis* **6(4)** 727-754.

Special case (simple regression of a single predictor)

Goldstein (2015) give formulae for obtaining corrections for the slope and intercept ina simple regression with one predictor, x, of, outcome, y. In particular if we know the reliability of x, R, equal to variance(x(true))/variance(x(obs)) where x(obs) is x(true) + measurement error then

if y = a* + b*x(obs) + e* for observed x and

y = a + bx(true) + e for the true value of x then for intercepts a and slope b corresponding to the true value of x

b = b*/R = b / (b sd(x)/sd(y) ) = sd(y) / sd(x) a = ybar - b xbar

This formula is used by the *leiv* routine mentioned above using the correlation between x and y as the measure of R, the reliability of x. For example if x has mean 7.35 (sd = 5.53), y has a mean of 7.22 (sd= 4.70), correlation(x,y) = 0.70, the slope of b* for x(obs) is 0.60 then

b = b*/R = 0.60/0.70 = 0.85 = 4.70 / 5.53 = sd(y) / sd(x) .

a = ybar - b xbar = 7.22 - 0.85 x 7.35 = 0.97.

Goldstein also suggests using a range of reliabilites corresponding to R=1, R=0.75 and R=0.65 to assess the sensitivity of the regression coefficients to measurement error. He also recommends and illustrates adjustment for measurement error in a multiple regression using the Bayesian approach of Richardson and Gilks (1993) using WINBUGS freeware which may be run in R. An example of WINBUGS syntax for measurement error in a simple regression (one predictor, x) is given here. To run this syntax in R WINBUGS14 software needs to be downloaded and placed in the Program Files directory in the C: drive on your PC.

Standard errors for slope and intercepts in a simple regression

Given the formulae above for the slope and intercept corrected for measurement error when there is just one predictor we can use these to obtain standard errors for the slope and intercept using the delta method (see here).

Since the variance of a sample variance from a sample of size, n, equals sample variance times sqrt(2/n-1)) (see here).

Variance of slope = V( sd(y) / sd(x) ) = ( sd(y)^{2 } sd(x)^{2 } sqrt[2/(n-1)] - sd(x)^{2 } sd(y)^{2 } sqrt[2/(n-1)] ) / sd(x)^{4 } (assumes the standard deviations of x and y are uncorrelated)

If we ignore the uncertainty in the correlation then variance (slope) = uncorrected slope/correlation^{2 } (see page 218 in the pdf chapter here.)

Variance of intercept = V(ybar - slope xbar) = V(ybar) + slope^{2 } V(xbar) + xbar^{2 } V(slope) which assumes the slope of x on y is not related to specific values of x.

We could also construct bootstrap samples to obtain confidence intervals for the slope and intercept.

References

Goldstein H (2015) Jumping to the wrong conclusions. *Significance* **12(5)** 18-21. See this article here.

Lunn D, Jackson C, Best N, Thomas A and Spiegelhalter D (2012) The BUGS Book - A Practical Introduction to Bayesian Analysis. CRC Press / Chapman and Hall:London.

Richardson S and Gilks W (1993) Conditional independence models for epidemiological studies with covariate measurement error. *Statistics in Medicine* **12** 1703-1722.