FAQ/RegressionOutliers

FAQ/RegressionOutliers232015-05-06 16:06:06PeterWatson222015-05-06 16:05:24PeterWatson212013-09-19 11:26:55PeterWatson202013-09-19 11:24:47PeterWatson192013-03-08 10:17:44localhostconverted to 1.6 markup182011-09-20 11:28:42PeterWatson172010-01-25 13:57:32PeterWatson162010-01-25 13:55:19PeterWatson152010-01-25 13:53:59PeterWatson142008-01-29 10:21:06PeterWatson132008-01-29 10:20:27PeterWatson122008-01-29 10:20:08PeterWatson112008-01-29 10:17:42PeterWatson102008-01-29 10:16:51PeterWatson92007-10-03 16:33:09PeterWatson82007-10-03 16:32:36PeterWatson72007-10-03 16:28:27PeterWatson62007-10-03 16:27:31PeterWatson52007-10-03 16:27:00PeterWatson42006-07-20 14:13:14pc0082.mrc-cbu.cam.ac.uk32006-06-30 22:57:01Scripting Subsystem22006-06-30 22:55:30Scripting Subsystem12006-06-30 21:37:50Scripting Subsystem

Checking for outliers in regressionAccording to Hoaglin and Welsch (1978) leverage values above 2(p+1)/n where p predictors are in the regression on n observations (items) are influential values. If the sample size is < 30 a stiffer criterion such as 3(p+1)/n is suggested. Leverage is also related to the i-th observation's Mahalanobis distance, MD(i), such that for sample size, N Leverage for observation i = MD(i)/(N-1) + 1/N so Critical MD(i) = (2(p+1)/N - 1/N)(N-1) (See Tabachnick and Fidell) Other outlier detection methods using boxplots are in the Exploratory Data Analysis Graduate talk located here or by using z-scores using tests such as Grubb's test - further details and an on-line calculator are located here. Hair, Anderson, Tatham and Black (1998) suggest Cook's distances greater than 1 are influential. Hair et al mention that some people also use 4/(N-k-1) for k predictors and N points as a threshold for Cook’s distance which usually gives a lower threshold than 1 (e.g. with 1 predictor and 27 observations this gives 4/(27-1-1) = 0.16). A third threshold of 4/N is also mentioned (Bollen and Jackman (1990)) which would give a threshold of 4/27 = 0.14 in the above example. References Bollen, K. A. and Jackman, R. W. (1990) Regression diagnostics: An expository treatment of outliers and influential cases, in Fox, John; and Long, J. Scott (eds.); Modern Methods of Data Analysis (pp. 257-91). Newbury Park, CA: Sage. Hair, J., Anderson, R., Tatham, R. and Black W. (1998). Multivariate Data Analysis (fifth edition). Englewood Cliffs, NJ: Prentice-Hall. Hoaglin, D. C. and Welsch, R. E. (1978). The hat matrix in regression and ANOVA. The American Statistician 32, 17-22. Return to Statistics FAQ page Return to Statistics main page Return to CBU main page These pages are maintained by Ian Nimmo-Smith and Peter Watson