How should I deal with skew when doing correlations?

Skewness (where the data is bunched at one end e.g. ceiling or floor effects) and in particular outliers can give spurious Pearson correlations.

To properly analyse the effects of skew one should look at the residuals from a regression using one of the variables as a predictor of the other. If the residuals are not normally distributed about zero the Pearson correlation could be unreliable. This can be checked by plotting - see regression talk at StatsCourse2006.

A suggested strategy is to transform one of the two variables, using either a power transform, or if the residuals are still non-normal after that, a rank transform (Spearman's rho or Kendall's tau-b) or compute Normal scores after separately ranking each pair of variables which are to be correlated (Bishara and Hittner, 2012).

de Winter, Golsing and Potter (2016) suggest using Pearson correlations for 'light-tailed' distributions and the Spearman correlation for heavier tailed distributions e.g. when outliers are present.

Outliers should not be deleted unless there is some measurement problem (Langkjaer-Bain R (2017)).

Further Discussion

Bishara, A. J. and Hittner, J. B. (2012) Testing the Significance of a Correlation With Nonnormal Data: Comparison of Pearson, Spearman, Transformation, and Resampling Approaches. Psychological Methods 17 (3) 399–417.

de Winter, J. C. F., Gosling, S. D. & Potter, J. (2016) Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: a tutorial using simulations and empirical data. Psychological Methods 21(3) 273-290.

Dunlap, W. P., Burke, M. J., & Greer, T. (1995). The effect of skew on the magnitude of product-moment correlations. Journal of General Psychology, 122, 365-377.

Langkjaer-Bain R. (2017) The murky tale of Flint's deceptive water data. Significance 14(2) 17-21.

MRC CBU Wiki

Quick Links

Search Wiki

Page Tools

How should I deal with skew when doing correlations?