How do I compute Akaike's and Bayesian information criteria (AIC, BIC) to compare regression models and how do I interpret them?

Akaike's information criterion is used to compare both the efficiency of multivariate models looking at the same data combining the degree of fit with the number of terms in the model. Better fitting simpler models are preferred with smaller AICs. AIC can be used as an alternative to the F ratio in stepwise regressions investigating the effectiveness of adding or subtracting one or more predictors from a model (see an example in the Regression Grad talk). Information criteria can also be used to compare logistic regression models with overdispersion (Agresti, 1996).

AIC = n ln(RSS/n) + 2 df(model)

where RSS is the Residual Sum of Squares which is routinely outputted from the regression analysis, n is the total sample size and df(model) is the degrees of freedom of the regression model which is the number of parameters equal to the number of predictors + 1 (for the intercept). The above formula for AIC is also given on page 63 of Burnham and Anderson (2002).

There is also a Bayesian Information Criterion (BIC) or Schwarz's criterion

BIC = n ln(RSS/n) + [(k+1) ln(n)]/n

where n is the total sample size and there are k parameters (including the intercept). The BIC may be used as a form of Bayes Factor (see for example here) and also here comparing logistic regression models.

Nagin (1999) suggests using bij=exp(BIC(1)-BIC(2)) as a means of deciding on whether one BIC is meaningfully lower than another BIC (page 147 and Table 2 on page 148 gives some rules of thumb). It is also mentioned in Chapter Four of Nagin(2005).

bij		Interpretation
bij < 1/10		Strong evidence for model j
1/10 < bij < 1/3		Moderate evidence for model j
1/3 < bij < 1		Weak evidence for model j
1 < bij < 3		Weak evidence for model
3 <bij< 10		Moderate evidence for model
bij > 10		Strong evidence for model

From here a raw difference between a pair of BICs of more than 10 is regarded as a difference in model fit for BICs obtained in structural equation models (see Raftery (1995)).

Jones, Nagin and Roeder (2001) alternatively suggest using twice the raw difference in BICs to compare models.

2(Diff in BICs)		Interpretation
0 to 2		Not worth mentioning
2 to 6		Positive
6 to 10		Strong
> 10		Very Strong

On a related note Shafer and, also, Jeffreys (1961) give rules of thumb for sizes of Bayes Factors (which compare an alternative model to a null model) suggesting Bayes Factors under 3 are weak (Shafer) and anecdotal (Jeffreys).

Some rules of thumb for using Bayes factors (Jeffreys 1961)

1 < Bayes factor <= 3		weak evidence for M1
3 < Bayes factor <= 10		substantial evidence for M1
10 < Bayes factor <= 100		strong evidence for M1
100 < Bayes factor		decisive evidence for M1

Kass and Raftery (1995) came up with slightly different rules of thumb to evaluate sizes of Bayes Factors: 1 to 3 (Not worth more than a bare mention), 3 to 20 (Positive), 20 to 150 (Strong) and > 150 (Very Strong) and use R code to compare the equality of group variances. There are also some rules of thumb for Bayes Factors on page 10 of the presentation here.

You can combine Bayes Factors, for example, if we have three conditions A, B and C and we wish to test if the mean of B is nearer to the mean of A or to the mean of C we could perform two one sample t-tests on the differences B-A and C-B, obtain Bayes Factors, BFs, for these two tests and take the ratio BF(B-A)/BF(C-B) with a high value > 1 suggesting the mean of B is closer to the mean of C and a value < 1 suggesting the mean of B is closer to the mean of A.

Free Bayesian analysis software (JASP) is available from here which acts as a front end to the R BayesFactor procedure.
Jeon M and De Boeck P (2017) compare translational approaches finding that a p-value of 0.01 is roughly equivalent to a Bayes Factor of 3 and refute earlier work linking this Bayes Factor to a p-value of 0.05 (see below and Dienes Z. (2014)).
Simulations with R code for a Bayesian power analysis with details here if the link is broken. A t-test Bayesian power simulation is here reproduced here if the link is broken.
A web calculator converting a Bayes Factor into a Cohen's d is here. Note however that there are no established procedures for computing power for the Bayes factor analysis. Indeed, Dienes (2014) has argued that Bayes factors obviate the need to perform power analyses.
A note about choice of priors for a one sample binomial test

There is also a pdf guide to computing and interpreting Bayes Factors from JASP (software) in factorial ANOVAs here. In particular pages 28-31 of this guide show how to compare pairs of Bayes Factors representing different models fitted to the same data e.g. with and without a main effect to assess the importance of the extra terms (e.g. the main effect) in the fuller model using the Bayes Factors. Our experiences fitting these models suggests that there is close agreement in inference between pairwise comparisons of these Bayes Factors and classical maximum likelihood approaches such as the F test. In other words if group means are found to differ using the F test they will also be seen to do when comparing pairs of Bayes Factors and vice-versa and similarly if there is no evidence of differences in group means using the F test there will be a similar lack of evidence of group means differences comparing the Bayes Factors.

The journal issue of Psychological Methods 2017, Volume 22, No. 2 is entirely devoted to describing and illustrating applications of Bayesian methods including the evaluation of Bayes factors in hierarchical analysis of variance using the BayesFactor procedure in R, comparing Bayes Factors equal to 3 with p-values (Jeon M and De Boeck P (2017) find these seem to approximately correspond to a p-value of 0.01) and the comparison of model selection criteria in Factor Analysis with the BIC performing well. Jeon M and De Boeck P (2017) further suggest the traditional view that a Bayes Factor of 3 can correspond to a p-value of 0.05 ( as suggested, for example, here or or here if the link is broken) may be incorrect but that using a Bayes Factor of 3 may still be a good cut-off to use for suggesting a rejection of the null hypothesis.

Halsey (2019) presents an overview of reporting p-values, Bayes Factors, effect sizes and their confidence intervals. If this link is broken a pdf copy of Halsey's paper is here.

An on-line web calculator (Rouder et al., 2009) which also (like JASP above) uses the BayesFactor procedure in R converts t values to Bayes Factors. This calculator is available to use here. There is also a companion calculator here for obtaining Bayes Factors from regression coefficients (Liang et al., 2008).

Note that there is also a second order AIC (Sugiura 1978, Hurvich and Tsai 1991) which is recommended for small sample sizes.

References

Agresti A (1996) An introduction to categorical data analysis. Wiley:New York.

Burnham, K.P., and Anderson, D.R. 2002. Model selection and multimodel inference: a practical information-theoretic approach, second edition. Springer-Verlag, New York.

(A pdf copy of the above book may also be downloaded for free from here.)

Dienes Z. (2014) Using Bayes to get the most out of non-significant results. Frontiers in Psychology, 5: 781.

Halsey LG (2019) The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum? Biol. Lett. 15 20190174.

Hurvich CM and Tsai C-L (1991) Bias of the corrected AIC criterion for underfitted regression and time series models. Biometrika 78 499–509.

Jeffreys H (1961) Theory of Probability, 3rd ed. Oxford Classic Texts in the Physical Sciences. Oxford Univ. Press: Oxford.

Jeon M and De Boeck P (2017) Decision qualities of Bayes factor and p value-based hypothesis testing. Psychological Methods 22(2) 340-360.

Jones B, Nagin D & Roeder KA (2001) SAS Procedure Based on Mixture Models for Estimating Developmental Trajectories. Sociological Methods & Research 29 374-393.

Kass RE & Raftery AE (1995) Bayes factors. Journal of the American Statistical Association 90 773-795.

Liang F, Paulo R, Molina G, Clyde MA and Berger JO (2008), Mixtures of g Priors for Bayesian Variable Selection. Journal of the American Statistical Association 103, 410-423.

Nagin DS (1999) Analyzing Developmental Trajectories: A Semiparametric, Group-Based Approach. Psychological Methods 4(2) 139-157.

Nagin DS (2005) Group-based Modeling of Development. Harvard University Press: Massachusetts.

Raftery AE (1995). Bayesian Model Selection in Social Research. Sociological Methodology, 25, 111-163.

Rouder JN, Speckman PL, Sun D, Morey RD, & Iverson G (2009) Bayesian t-Tests for Accepting and Rejecting the Null Hypothesis. Journal Psychonomic Bulletin & Review 16 225-237.

Sugiura N (1978) Further analysis of the data by Akaike’s information criterion and the finite corrections. Communications in Statistics: Theory and Methods A7 13–2