Size: 17755
Comment:
|
Size: 5749
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 168: | Line 168: |
1. '''Post-hoc tests, multiple comparisons, contrasts and handling interactions''' | 1. '''What to do following an ANOVA''' |
Line 170: | Line 170: |
10: Post-hoc tests, Multiple comparisons, Contrasts and handling Interactions What to do following an ANOVA Ian Nimmo-Smith http://imaging.mrc-cbu.cam.ac.uk/statswiki/StatsCourse2006 Aims and Objectives Why do we use follow-up tests? Different ways to follow up an ANOVA Planned vs. Post Hoc Tests Contrasts and Comparisons Choosing and Coding Contrasts Handling Interactions Example: Priming experiment Between subjects design Priming factor Number correct Example: Priming experiment (II) ANOVA: F(3,36)=6.062 P=0.002** So what? Why Use Follow-Up Tests? The F-ratio tells us only that the experiment had a positive outcome i.e. group means were different. It does not tell us specifically which group means differ from which. We need additional tests to find out where the group differences lie. How? A full toolkit A: Standard Errors of Differences B: Multiple t-tests C: Orthogonal Contrasts/Comparisons D: Post Hoc Tests E: Trend Analysis F: Unpacking interactions A: Standard Errors as Yardsticks A: Standard Errors from SPSS A: Plotting Standard Errors A: Plotting Standard Errors of the Differences A: Plotting 95% Confidence Intevals of Differences B: Multiple t-tests B: ‘LSD’ option B: ‘LSD’ option The problem with doing several null-hypothesis tests Each test is watching out for a rare event with prevalence a [Type I Error Rate]. The more test you do the more likely you are to observe a rare event. If there are N tests of Size a then the expected number of Type I Errors is N.a With a = 0.05 we can expect 5/100 test will reject their Null Hypotheses ‘by chance’. This phenomenon is known as Error Rate Inflation. Multiple Comparisons: Watch your Error Rate! Post-Hoc vs A Priori Hypotheses Comparison between a pair of conditions/means Contrast between two or more conditions/means Type I Error Rates Per Comparison (PC) error rate () probability of making a Type I Error on a single Comparison Family-wise (FW) error rate probability of making at least one Type I error in a family (or set) of comparisons (also known as Experimentwise error rate) PC FW c or 1-(1-)c where c is the number of comparisons Problem of Multiple Comparisons A numerical example of Error Rate inflation Suppose C independent significant tests with size . And suppose all the null hypotheses are true. The probability (*) of at least one significant result (Type I error) is bigger. *=1-(1- )C =0.05 C=6 (Say, comparable to all contrasts between 4 conditions) *=0.26 !!! So the Familywise Error Rate is 26%, though each individual test has Error Rate 5%. What is to be done? Various Approaches Orthogonal Contrasts or Comparisons Planned Comparisons vs. Post Hoc Comparisons Orthogonal Contrasts/Comparisons Hypothesis driven Planned a priori Usually accepted that nominal significance can be followed (i.e. no need for adjustment) Rationale: we are really interested in each comparison/contrast on its own merits. We have no wish to make pronoucements at the Familywise level. A Priori (= Planned) Typically there is a rationale which identifies a small number of (sub)-hypotheses which led to the formulation and design of the experiment. These correspond to Planned or A Priori comparisons So long as there is no overlap (non-orthogonality) between the comparisons Post Hoc Tests Not Planned (no hypothesis) also known as a posteriori tests E.g. Compare all pairs of means Planned Comparisons or Contrasts Basic Idea: The variability explained by the Model is due to subjects being assigned to different groups. This variability can be broken down further to test specific hypotheses about ways in which groups might differ. We break down the variance according to hypotheses made a priori (before the experiment). Rules When Choosing Contrasts Independent: contrasts must not interfere with each other (they must test unique hypotheses). Simplest approach compares one chunk of groups with another chunk At most K-1: You should always end up with one less contrast than the number of groups. How to choose Contrasts? In many experiments we have one or more control groups. The logic of control groups dictates that we expect them to be different from some of the groups that we’ve manipulated. The first contrast will often be to compare any control groups (chunk 1) with any experimental conditions (chunk 2). Contrast 1 Between subject experiment One-way ANOVA Control vs. Experimental Control - (Semantic+Lexical+Phobic)/3 (1,-1/3,-1/3,-1/3) or (3,-1,-1,-1) Contrasts 2 and 3 Phobic versus Non-phobic priming (Semantic+Lexical)/2 - Phobic (0,1,1,-2) Semantic vs Lexical Semantic - Phobic (0,1,-1,0) One-way ANOVA contrasts GLM Univariate syntax Output from contrast analysis Rules for Coding Planned Contrasts Rule 1 Groups coded with positive weights will be compared to groups coded with negative weights. Rule 2 The sum of weights for a contrast must be zero. Rule 3 If a group is not involved in a contrast, assign it a weight of zero. More Rules … Rule 4 For a given contrast, the weights assigned to the group(s) in one chunk of variation should be equal to the number of groups in the opposite chunk of variation. Rule 5 If a group is singled out in a comparison, then that group should not be used in any subsequent contrasts. Partitioning the Variance Contrasts in GLM (I) Off-the-shelf only via the menus but can use ‘Special’ via syntax ‘Deviation’ compare each level with average of preceding (1,-1,0,0) (1,1,-2,0) (1,1,1,-3) Contrasts in GLM (II) ‘Simple’ compare each level with either the first or the last (1,-1,0,0) (1,0,0,-1) (1,0,-1,0) or (0,1,0,-1) (1,0,0,-1) (0,0,1,-1) Contrasts in GLM (III) ‘Helmert’ and ‘Repeated’ compare each level with mean of previous or subsequent levels (1,-1,0, 0) (3,-1,-1,-1) (2,-1,-1,0) or (0,2,-1,-1) (3,-1,-1,-1) (0,0,1,-1) Contrasts in GLM (IV) ‘Polynomial’ Divide up effects into Linear, Quadratic, Cubic … contrasts Appropriate when considering a Trend Analysis over time or some other covariate factor Non-orthogonal a priori contrasts To correct or not to correct? Typically, if the number of planned comparisons is small (up to K-1), Bonferroni type corrections are not applied. Post Hoc Tests (I) Compare each mean against all others. In general terms they use a stricter criterion to accept an effect as significant. Hence, control the Familywise error rate. Simplest example is the Bonferroni method: divide the desired Familywise Error Rate a by the number of comparisons c and use a/c as the Bonferroni Corrected Per Comparison Error Rate a*. With 2 means, a=0.05, c=1 then a*=0.050 With 3 means, a=0.05, c=3 then a*=0.017 With 4 mean, a=0.05, c=6 then a*=0.008 Post Hoc Tests (II) What to include? All experimental treatments vs. a control treatment All possible pairwise comparisons All possible contrasts These all need different handling. Simple t Fisher's Least Significant Difference: LSD unprotected Fisher's protected t Only proceed if the omnibus F test is significant. Controls familywise error rate, but may miss out needles in the haystack. The significance of the overall F An overall significant F test is not a pre-requisite for doing planned comparisons. There still remains a considerable amount of over earlier statements to the contrary. Not least in the minds of some reviewers Bonferroni formulae Seeking -familywise of Bonferroni t or Dunn's test Set -per-comparison = /c Bonferroni correction conservative Dunn-Sidak Set -per-comparison = 1-(1-)1/c improved (exact, less conservative) Bonferroni correction which usually makes very little difference Multi-stage Bonferroni procedures (I) For controlling family-wise error rate with a set of comparisons less than all pairwise comparisons. Partitions target family-wise in different ways amongst the comparisons. Multi-stage Bonferroni procedures (II) Holm procedure Larzelere and Mulaik procedure Can be applied to subsets of correlation from a correlation matrix. Limitations of Bonferroni These procedures are based on ‘worst case’ assumption that all the tests are independent. Beware SPSS’s ‘Bonferroni adjusted’ P value http://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/SpssBonferroni The problem: SPSS has sought to preserve the ‘Look at the P Value to find out if the Test indicates the Null Hypothesis should be rejected’ SPSS quotes artificial ‘Bonferroni Adjusted P Values’ rather than advising of the appropriate Bonferroni Corrected a Can end up with oddities like P=1!!!! (if c>1 then SPSS Bonferroni Adjusted P is set to 1) What to do about SPSS and Bonferroni t Avoid! Either use LSD and work out the Bonferroni corrected a yourself Or use Sidak adjusted P’s Studentized Range Statistic Based on dependence of the ‘range’ of a set of means on the number of means that are being considered. Larger number of means correlates with bigger range Tables of Q(r, k, df) r = number of means current range k = number of means overall df = degrees of freedom in mean square error Newman-Keuls Procedure Critical difference is a function of r and k (and the degrees of freedom) Strong tradition but some recent controversy. A 'layered' test which adjusts the critical distance as a function of the number of means in the range be considered at each stage. The family-wise error rate may not be actually be held at Constant critical distance procedures Tukey's Test: HSD Honestly Significant Difference Like Newman-Keuls, only use the largest (outmost) critical distance throughout. Ryan's Procedure: REGWQ Ryan: r= k /r Einot and Gabriel: r=1-(1- )r/k Scheffé's test All possible comparisons Most conservative Post Hoc Tests: Options in SPSS SPSS has 18 types of Post Hoc Test! Post Hoc Tests: Recommendations Field (2000): Assumptions met: REGWQ or Tukey HSD. Safe Option: Bonferroni (but note the problem with SPSS’s approach to adjusted P values. Unequal Sample Sizes: Gabriel’s (small), Hochberg’s GT2 (large). Unequal Variances: Games-Howell. Control of False Discovery Rate (FDR) Recent alternative to controlling FW error rate FDR is the expected proportion of false rejections among the rejected null hypotheses If we have rejected a set of R null hypotheses, and V of these are wrongly rejected, then the FDR= V/R (FDR=0 if R=0). We will know R but don’t know V. False Discovery Rate (FDR) An alternative approach to the trade-off between Type I and Type II errors Logic a type I error becomes less serious the more the number of genuine effects there are in the family of tests Now being used in imaging and ERP studies FDR Example Suppose we do 10 t-tests and observe their P values: 0.021, 0.001, 0.017, 0.041, 0.005, 0.036, 0.042, 0.023, 0.07, 0.1 Sort P values in ascending order 0.001, 0.005, 0.017, 0.021, 0.023, 0.036, 0.041, 0.042, 0.07, 0.1 Compare with 10 the prototypical P-values scaled to a namely (1/10)a, (2/10)a, ..., (10/10)a, 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 Get the differences: -0.004 -0.005 0.002 0.001 -0.002 0.006 0.006 0.002 0.025 0.050 The largest observed P-value which is smaller than its corresponding prototype is 0.23 0.021, 0.001, 0.017, 0.041, 0.005, 0.036, 0.042, 0.023, 0.07, 0.1 The five underlined tests for which P<0.023 are declared significant with FDR a=0.05. Unpacking Interactions Components of Interaction doing sub-ANOVA or contrasts E.g. if factors A and B interact, then look at 2 by 2 ANOVA for a pair of levels of A combined with a pair of levels of B Simple Main Effects by doing sub-ANOVAs followed up by multiple comparisons Cannot use reverse argument to claim presence of interaction Can use EMMEANS/COMPARE in SPSS Within Subject Factors? (I) Problems of ‘sphericity’ re-emerge Need for hand-calculations as SPSS has an attitude problem Ask Ian about the ‘mc’ program and other statistical aids in the CBU statistical software collection. Work is in progress to get these updated for the new statistics wiki. Within Subject Factors? (II) Calculate new variables from within-subject contrasts and analyse them separately Extends the idea of doing lots of paired t-tests In SPSS can be done by MMATRIX option in GLM using Syntax GLM and MMATRIX GLM and MMATRIX Thank you - That’s All Folks! Peter Watson Andy Field: http://www.cogs.sussex.ac.uk/users/andyf/teaching/statistics.htm |
* Why do we use follow-up tests? * Different ways to follow up an ANOVA * Planned vs. Post Hoc Tests * Choosing and Coding Contrasts * Handling Interactions * Standard Errors of Differences * Multiple t-tests * Post Hoc Tests * Trend Analysis * Unpacking interactions * Multiple Comparisons: Watch your Error Rate! * Post-Hoc vs A Priori Hypotheses * Comparisons and Contrasts * Family-wise (FW) error rate * Experimentwise error rate * Orthogonal Contrasts or Comparisons * Planned Comparisons vs. Post Hoc Comparisons * Orthogonal Contrasts/Comparisons * Planned Comparisons or Contrasts * Contrasts in GLM * Post Hoc Tests * Control of False Discovery Rate (FDR) * Simple Main Effects |
Synopsis of the Graduate Statistics Course 2007
The Anatomy of Statistics: Models, Hypotheses, Significance and Power
- Experiments, Data, Models and Parameters
- Probability vs. Statistics
- Hypotheses and Inference
- The Likelihood Function
- Estimation and Inferences
- Maximum Likelihood Estimate (MLE)
- Schools of Statistical Inference
- Ronald Aylmer FISHER
- Jergy NEYMAN and Egon PEARSON
- Rev. Thomas BAYES
- R A Fisher: P values and Significance Tests
- Neyman and Pearson: Hypothesis Tests
Type I & Type II Errors
- Size and Power
Exploratory Data Analysis (EDA)
- What is it?
- Skew and kurtosis: definitions and magnitude rules of thumb
- Pictorial representations - in particular histograms, boxplots and stem and leaf displays
- Effect of outliers
- Power transformations
- Rank transformations
Categorical Data Analysis
- The Naming of Parts
- Categorical Data
- Frequency Tables
- The Chi-Squared Goodness-of-Fit Test
- The Chi-squared Distribution
- The Binomial Test
- The Chi-squared test for association
Simpson, Cohen and McNemar
- SPSS procedures that help
- Frequencies
- Crosstabs
- Chi-square
- Binomial
- Types of Data
- Quantitative
- Qualitative
- Nominal
- Ordinal
- Frequency Table
- Bar chart
- Cross-classification or Contingency Table
- Simple use of SPSS Crosstabs
- Goodness of Fit Chi-squared Test
- Chance performance and the Binomial Test
- Confidence Intervals for Binomial Proportions
- Pearson’s Chi-squared
- Yates’ Continuity Correction
- Fisher’s Exact Test
- Odds and Odds Ratios
- Log Odds and Log Odds ratios
- Sensitivity and Specificity
- Signal Detection Theory
- Simpson’s Paradox
- Measures of agreement: Cohen's Kappa
Measures of change: McNemar’s Test
- Association or Independence: Chi-squared test of association
- Comparing two or more classified samples
Regression
- What is it?
- Expressing correlations (simple regression) in vector form
- Scatterplots
- Assumptions in regression
- Restriction of range of a correlation
- Comparing pairs of correlations
- Multiple regression
- Least squares
- Residual plots
- Stepwise methods
- Synergy
- Collinearity
Between subjects analysis of variance
- What is it used for?
- Main effects
- Interactions
- Simple effects
- Plotting effects
- Implementation in SPSS
- Effect size
- Model specification
- Latin squares
- Balance
- Venn diagram depiction of sources of variation
The General Linear Model and complex designs including Analysis of Covariance
- GLM and Simple Linear Regression
- The Design Matrix
- Least Squares
- ANOVA and GLM
- Types of Sums of Squares
- Multiple Regression as GLM
- Multiple Regression as a sequence of GLMs in SPSS
- The two Groups t-test as a GLM
- One-way ANOVA as GLM
- Multi-factor Model
- Additive (no interaction)
- Non-additive (interaction)
- Analysis of Covariance
- Simple regression
- 1 intercept
- 1 slope
- Parallel regressions
- multiple intercepts
- 1 slope
- Non-parallel regressions
- multiple intercepts
- multiple slopes
- Simple regression
- Sequences of GLMs in ANCOVA
Power analysis
- Hypothesis testing
- Boosting power
- Effect sizes: definitions, magnitudes
- Power evaluation methods:description and implementation using an examples
- nomogram
- power calculators
- SPSS macros
- spreadsheets
- power curves
- tables
- quick formula
Repeated Measures and Mixed Model ANOVA
- Two sample t-Test vs. Paired t-Test
- Repeated Measures as an extension of paired measures
- Single factor Within-Subject design
- Sphericity
- Two (or more) factors Within-Subject design
- Mixed designs combining Within- and Between-Subject factors
Mixed Models, e.g. both Subjects & Items as Random Effects factors
- The ‘Language as Fixed Effects’ Controversy
- Testing for Normality
- Single degree of freedom approach
Latent variable modelling – factor analysis and all that!
- Path diagrams – a regression example
- Comparing correlations
- Exploratory factor analysis
- Assumptions of factor analysis
- Reliability testing (Cronbach’s alpha)
- Fit criteria in exploratory factor analysis
- Rotations
- Interpreting factor loadings
- Confirmatory factor models
- Fit criteria in confirmatory factor analysis
- Equivalence of correlated and uncorrelated models
- Cross validation as a means of assessing fit for different models
- Parsimony : determining the most important items in a factor analysis
What to do following an ANOVA
- Why do we use follow-up tests?
- Different ways to follow up an ANOVA
- Planned vs. Post Hoc Tests
- Choosing and Coding Contrasts
- Handling Interactions
- Standard Errors of Differences
- Multiple t-tests
- Post Hoc Tests
- Trend Analysis
- Unpacking interactions
- Multiple Comparisons: Watch your Error Rate!
- Post-Hoc vs A Priori Hypotheses
- Comparisons and Contrasts
- Family-wise (FW) error rate
- Experimentwise error rate
- Orthogonal Contrasts or Comparisons
- Planned Comparisons vs. Post Hoc Comparisons
- Orthogonal Contrasts/Comparisons
- Planned Comparisons or Contrasts
- Contrasts in GLM
- Post Hoc Tests
- Control of False Discovery Rate (FDR)
- Simple Main Effects