Diff for "FAQ/studentres" - CBU statistics Wiki
location: Diff for "FAQ/studentres"
Differences between revisions 11 and 12
Revision 11 as of 2013-03-08 10:17:31
Size: 1966
Editor: localhost
Comment: converted to 1.6 markup
Revision 12 as of 2015-01-28 16:29:08
Size: 1768
Editor: PeterWatson
Comment:
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
 1. Standardize both the response variable and the predictor variable by subtracting their means and dividing by their standard deviations, call these $$y_text{s}$$ and $$x_text{s}$$.  1. Standardize both the response variable and the predictor variable by subtracting their means and dividing by their standard deviations, call these y(s) and x(s).
Line 11: Line 11:
 1. Obtain the i-th raw residual as $$Y_text{si} - Rx_text{si}$$  1. Obtain the i-th raw residual as Y(si) - Rx(si)
Line 17: Line 17:
SE_RES equals $$s \sqrt{1 - h_text{ii}}$$ where s equals $$\sum_{i}(Y_text{si} - Rx_text{si}$$)/(N-2) for N observations and $$h_text{ii}$$ equals $$\frac{1}{N} + \frac{x_text{si}^text{2}}{\sum_{i}x_text{si}^text{2}} SE_RES equals s Sqrt[1 - h(ii)] where s equals Sum i(Y(si) - Rx(si})/(N-2) for N observations and h(ii) equals 1/N + x(si)^2 ^/Sum i x(si)^2 ^
Line 25: Line 25:
In this case $$h_text{ii}$$ equals 1/N and s is the standard deviation since the predicted value for Y is simply its mean. In this case h(ii) equals 1/N and s is the standard deviation since the predicted value for Y is simply its mean.
Line 27: Line 27:
So it follows SE_RES which equals $$s \sqrt{1 - h_text{ii}}$$ = SD $$\sqrt{1 - 1/N}$$ = $$ SD \sqrt{\frac{N-1}{N}}$$. So it follows SE_RES which equals s Sqrt{1 - h(ii) = SD Sqrt(1 - 1/N) = SD Sqrt[(N-1)/N].
Line 29: Line 29:
The studentized outlier is therefore equal to $$\frac{Y - \mbox{mean(Y)}}{\mbox{SD} \sqrt{\frac{N-1}{N}}} \approx \frac{Y - \mbox{mean(Y)}}{\mbox{SD}}$$ when N is large. The studentized outlier is therefore equal to (Y - mean(Y))/[SD (N-1)/N] \approx (Y - mean(Y))/SD when N is large.

How do I check for outliers in a simple regression with one predictor variable?

A simple way to check for outliers is to evaluate either standardized or studentized residuals and see if there are many with high values e.g. > +/- 2. The key reason for studentizing is that the variances of the residuals at different predictor values are different.

This can be done as follows:

  1. Standardize both the response variable and the predictor variable by subtracting their means and dividing by their standard deviations, call these y(s) and x(s).
  2. Evaluate a Pearson or Spearman correlation, R.
  3. Obtain the i-th raw residual as Y(si) - Rx(si)
  4. To obtain the standardized residual just divide by the standard deviation of the residuals. The mean raw residual should be zero.
  5. The studentized residual may also be used to identify potential outliers. This divides the raw residual by its standard error, SE_RES.

SE_RES equals s Sqrt[1 - h(ii)] where s equals Sum i(Y(si) - Rx(si})/(N-2) for N observations and h(ii) equals 1/N + x(si)2 /Sum i x(si)2

Studentised residuals may be evaluated using this spreadsheet.

Outliers without adjusting for other variables

In this case where we are interested in outliers of a variable unadjusted for any others the studentized residual is approximately equal to the standardized residual (ie a z-score) for large N.

In this case h(ii) equals 1/N and s is the standard deviation since the predicted value for Y is simply its mean.

So it follows SE_RES which equals s Sqrt{1 - h(ii) = SD Sqrt(1 - 1/N) = SD Sqrt[(N-1)/N].

The studentized outlier is therefore equal to (Y - mean(Y))/[SD (N-1)/N] \approx (Y - mean(Y))/SD when N is large.

None: FAQ/studentres (last edited 2016-01-19 11:23:02 by PeterWatson)