An alternative, to the kappa statistic, which uses an analysis of variance output to estimate rater reliability is the intraclass correlation coefficient (ICC).
For a repeated measures anova involving k raters it follows assuming raters are a fixed effect (and subjects are a random effect) that
ICC1 = [MS(subjects)-MS(subjects x raters)]/[MS(subjects) + (k-1)MS(subjects x raters)]
where MS is the Mean square from the repeated measures analysis of variance.
This corresponds to the (single measures) one-way random approach in SPSS using the consistency option (see here).
It follows that the intra-class correlation (ICC), unlike the Pearson correlation, is useful for pooling paired data each having three or more observations. Einfield and Tonge (1992, p 12) prefer using the ICC to the Pearson as it is more conservative owing to that fact it "takes account of the absolute as well as the relative difference between the scores of two raters".
Howell (1997) also recommends an alternative, more widely used ICC which assumes that the raters are a random sample from a larger population which is of form below:
ICC2 = [MS(subjects)–MS(subjects x raters)]/[MS(subjects) + (k-1)MS(subjects x raters) + k[MS(raters) - MS(subjects x raters)]/n]
where n is the number of subjects being rated. ICC2 agrees with the (single measures) two-way random (and two-way mixed) ICCs with a type of absolute agreement.
ICC2, unlike ICC1, gives a correct lower result because it tests for absolute inter-rater agreement. For example if two raters rate three subjects giving ratings 1,2; 2,4 and 3,6 respectively then ICC1 = 0.80 (consistency, two way random) and ICC2 = 0.46 (absolute agreement, two-way random).
ICC may be computed in SPSS using analyze>scale>reliability analysis>statistics and choosing one of the two ICCs which allow a type of absolute agreement and looking at single measures. The consistency option is to be avoided as it does not compare the differences in ratings between raters.
Examples of ICC computation in SPSS are available here and here. ICC1 above corresponds to sfsingle (ICC(1,1)) and ICC2 to sfrandom (ICC(2,1)) in Shrout and Fleiss (1979) which are also stated as the above formulae in SAS code here.
[true inter-rater variance]/[true inter-rater variance + common error in rating variance]
as mentioned as a reliability correlation in the two rater case, for example, in a paper by Martin Bland and Doug Altman.
An overview of approaches to inter rater reliability including the ICC is given by Darroch and McCloud (1986).
ICCs below 0.40 are regarded as poor, 0.40 to 0.59 as fair and above 0.60 as good (Cicchetti and Sparrow, 1981). Kramer and Feinstein (1981) also give rules of thumb for sizes of ICCs.
Intraclass correlations between raters can be assessed as well as ratings within the same participant (ICCs at the individual level).
SPSS syntax to perform Generalizability analyses (Mushquash and O'Connor (2006). These analyses (Brennan (2003)) are similar to, though not the same as, ICCs generating a coefficient which is a ratio of the estimated variance components for raters over the sum of the variance component for raters and the component for the interaction of raters and rates (error).
Brennan, RL (2003) Coefficients and indices in generalizability theory. Centre for Advanced Studies in Measurement and Assessment, CASMA Research Report, 1 1-44.
Cicchetti, DV and Sparrow SA (1981) Developing criteria for establishing interrater reliability of specific items: applications to assessment of adaptive behavior. Am J Ment Defic 86 127-37.
Darroch JN, McCloud PI (1986) Category distinguishability and observer agreement Australian Journal of Statistics 28 371-88.
Howell DC (1997) Statistical methods for psychologists. Fourth edition. Wadsworth:Belmont,CA. (pages 490-493).
Einfield, SL and Tonge, BJ (1992) Manual for the developmental hebaviour checklist (DBC)(Primary Carer version). Melbourne:School of Psychiatry, University of new South Wales, and Centre for Developmental Psychiatry, Monash University, Clayton, Victoria.
Kramer, MS and Feinstein, AR (1981) The biostatistics of concordance. Clinical Phamacology and Therapeutics 29 111-123.
McGraw KO and Wong SP (1996) Forming inferences about some intraclass correlation coefficients. Psychological Methods 1 30-46. (Correction,1,390). This paper recommends using two-way mixed model effects to obtain ICCs for ratings with the people rated treated as a random factor.
Mushquash C. and O’Connor BP (2006) SPSS and SAS programs for generalizability theory analyses. Behavior Research Methods, 38 (3), 542-547.
Shrout, PE and Fleiss, JL (1979). Intraclass Correlations: Uses in Assessing Rater Reliability, Psychological Bulletin, 86 (2) 420-428. (A good primer showing how anova output can be used to compute ICCs).