Illustration of inferiority of using a Pearson correlation to assess absolute agreement between raters as opposed to an ICC

The Intraclass Correlation Coefficient (ICC) is the measure appropriate for assessing interrater reliability, whether the number of raters is two or more than two. As the reviewer observes, the ICC derives from an analysis of the variance of the ratings, and contrasts the variance due to subjects with that due to raters. A high value of the ICC indicates that ratings differ mostly because they apply to different people, and relatively little because they were given by different raters. That is, it indicates that the ratings are consistent, or reliable, between the raters.

In contrast, Pearson’s correlation coefficient (r) is a measure of the strength of linear association between two variables. If there were 2 raters and one gave exactly half the score that the other gave to each subject, the Pearson’s correlation between the raters’ scores would be 1, because a perfect linear association (rating 2 = 0.5*rating 1) exists between them. But the ICC would be very low, because a large proportion of difference between ratings would be due to the different raters. The actual state of affairs being serious lack of agreement, the ICC would be giving the right message and the Pearson’s r the wrong message.

When the ICC is high there is strong agreement about ratings, and consequently a strong linear association between the scores given by the two raters. The converse, as illustrated above, is not true.

The ICC, therefore, takes account of the absolute as well as the relative difference between the scores of two raters (Einfeld & Tonge, 1992, p.12) and is the preferred statistics for computing inter-rater reliability (Streiner & Norman, 1995; Khan & Chien, 2001).

References

Einfeld, SL & Tonge BJ (1992) Manual for the developmental behaviour checklist (DBC) (Primary Carer version). Melbourne: School of Psychiatry, University of New South Wales and Centre for Developmental Psychiatry and Psychology, Monash University, Clayton, Victoria.

Khan, KS and Chien PFW (2001) Evaluation of a clinical test. I: assessment of reliability. British Journal of Obstetrics and Gynaecology 108 562-567.

Rousson, V, Gasser, T & Seifert, B (2002) Assessing intrarater, interrater and test-retest reliability of continuous measurements. Statistics in Medicine, 21, 3431-3446.

Streiner, DL and Norman GR (1995) Health measurement scales. A practical guide to their development and use. Oxford: Oxford University Press. There is a third edition of this book published in 2003 with the same authors, title and publishers.