NorasData - MRC CBU Language Wiki

Upload page content

You can upload content for the page named below. If you change the page name, you can also upload content for another page. If the page name is empty, we derive the page name from the file name.

File to load page content from
Page name
Comment
Flind the wroneg tetters tin eaech wrord

location: NorasData

Open R

Load all the packages that are needed

Import the data into R’s workspace with

n = read.csv("/Users/mcolthea/Desktop/RFolder/Nora/NoraData.csv")

Now the file called n is the whole raw data file

Check that it looks OK by reading the first 10 rows of the file:

> head(n, n = 10)

Looks OK.

Now, we aren’t going to want to include errors when we analyse RTs. So the first thing we want to do is create a file that only contains the correct responses

> nc = n[n$accuracy=="correct",]

This new file nc is the file that just has the correct responses.

What was the overall error rate?

> (nrow(n)-nrow(nc))/nrow(n)
[1] 0.0519774

5.2% of responses were errors.

One other thing is typically done before data analysis starts: eliminating any responses where the RT is suspiciously long or suspiciously short, i.e. “outliers”. This is often called “data-trimming”. Here you are eliminating noise from the data and so making subsequent statistical analyses more sensitive.

You can visualize the outliers in your data by inspecting the quantile-quantile plot for each subject. Here’s how you do this in R:

First, set things up so that you won’t have too many plots on each page; to get to the next page, you press Return

> devAskNewPage(ask=TRUE)

Now display the quantile-quantile plots

> qqmath(~RT|Subject.ID, data = nc, layout=c(3,3,4))

This says there are 3 rows and 3 columns per page, and 4 pages.

Nice clean data. Not many outliers. But there are a few obvious ones.

The best way to look at these is to sort each subject’s RTs from lowest to highest, and look at these sorted files. They can be sorted in R with this command

> ncs=nc[order(nc$Subject.ID, nc$RT),]

which sorts by Subject.ID and within each subject sorts by RT.

Trimming the data is more easily done in Excel, so export this sorted trial.

>write.csv(ncs,"/Users/mcolthea/Desktop/RFolder/Nora/ncs.csv”)

and then open it in Excel. Do the trimming in Excel. Call the trimmed file ncst. Import it to R.

> ncst = read.csv("/Users/mcolthea/Desktop/RFolder/Nora/ncst.csv")

What is the proportion of correct responses that have been discarded?

> (nrow(ncs)-nrow(ncst))/nrow(ncs)
[1] 0.01102503

1.1% of the data have been trimmed.

Look at the quantile-quantile plots again.

> devAskNewPage(ask=TRUE)
> qqmath(~RT|Subject.ID, data = ncst, layout=c(3,3,4))

The obvious outliers are gone. Now data analysis can begin using this trimmed data.

Analysing the data

What is the mean RT for each condition?

> tapply(ncst$RT, ncst$itemtype, mean)

   count     mass 
514.0556 523.8704

Is this mean difference of 9.8 msec significant?

> ncst.lmer=lmer(RT~itemtype + (1|Subject.ID) + (1|itemtype), data = ncst )

You can see the result of the analysis by typing:

>ncst.lmer

and if you do, you get

Linear mixed model fit by REML 
Formula: RT ~ itemtype + (1 | Subject.ID) + (1 | Item
   Data: ncst 
   
  AIC   BIC  logLik   deviance REMLdev
 39988 40018 -19989    39990   39978

Random effects:
 Groups     Name        Variance Std.Dev.
 Subject.ID (Intercept) 3262.700 57.1201 
 itemtype   (Intercept)   14.890  3.8588 
 Residual               9682.982 98.4021 
Number of obs: 3319, groups: Subject.ID, 30; itemtype, 2

Fixed effects:
             Estimate Std. Error t value
(Intercept)   513.274     11.295   45.44
itemtypemass   10.128      6.297    1.61

Correlation of Fixed Effects:
            (Intr)
itemtypemss -0.265

What does this all mean?

  • The t-value is 1.61 which wouldn’t be significant, but even if it were, we haven’t taken into account the potential confounding factor of LOG_Surface.freq. So add that factor to the model

> ncst.lmerA=lmer(RT~itemtype + (1|Subject.ID) + (1|Item) +LOG_Surface.freq, data = ncst )

And what about LOG_stem.freq?

> ncst.lmerB=lmer(RT~itemtype + (1|Subject.ID) + (1|Item) +LOG_Surface.freq+LOG_stem.freq , data = ncst )

So it looks as if there is no mass/count effect nor a surface frequency effect, but there is a stem frequency effect. Since we have a subset of the items that are matched on stem frequency, analyzing that subset next would be a good move.

Another thing we can do is build a set of different models incorporating different factors.

> ncst.lmerC=lmer(RT~(1|Subject.ID) + (1|Item) + LOG_stem.freq , data = ncst )

includes only the stem frequency factor.

>ncst.lmerD=lmer(RT~(1|Subject.ID) + (1|Item) + LOG_stem.freq +itemtype , data = ncst )

adds a second factor, item type. Does the second model fit better than the first?

> anova (ncst.lmerC,ncst.lmerD)

Data: ncst
Models:
ncst.lmerC: RT ~ (1 | Subject.ID) + (1 | Item) + LOG_stem.freq
ncst.lmerD: RT ~ (1 | Subject.ID) + (1 | Item) + LOG_stem.freq + itemtype
           Df   AIC   BIC logLik  Chisq Chi Df Pr(>Chisq)
ncst.lmerC  5 39802 39832 -19896                         
ncst.lmerD  6 39804 39840 -19896 0.3083    1.     0.5787

No. The insignificant Chi Squared shows that a model that includes item type and stem frequency does not produce a better fit than a model that only includes stem frequency.

Here’s a model that has surface frequency and item type:

> ncst.lmerG=lmer(RT~(1|Subject.ID) + (1|Item) +LOG_Surface.freq +itemtype  , data = ncst )

If we add stem frequency to this model

> ncst.lmerF=lmer(RT~(1|Subject.ID) + (1|Item) + LOG_stem.freq +LOG_Surface.freq +itemtype  , data = ncst )

Do we get a better fit? Does model F perform better than model G?

> anova (ncst.lmerG,ncst.lmerF)
Data: ncst
Models:
ncst.lmerG: RT ~ (1 | Subject.ID) + (1 | Item+ LOG_Surface.freq + itemtype
ncst.lmerF: RT ~ (1 | Subject.ID) + (1 | Item+ LOG_stem.freq + LOG_Surface.freq + 
ncst.lmerF:     itemtype
           Df   AIC   BIC logLik  Chisq Chi Df Pr(>Chisq)    
ncst.lmerG  6 39817 39854 -19903                             
ncst.lmerF  7 39805 39848 -19896 14.448    1.  0.0001441 ***

Yes. G is a significant improvement over F. Note how the various measures of fit are lower for F. That means the fit is closer.

So adding stem frequency gives a s a better model, but adding item type or surface frequency or both does not improve the fit of the model to the data.