Eye-Tracking Analysis Workshop

Dan Mirman

23 January 2019

Workshop Schedule

Day 1: Data wrangling, Pre-processing eye-tracking data

Time Content
9am - Noon Part 1: Data wrangling
Noon - 1:30pm Lunch break, Q&A, own data analysis time
1:30pm - 4:30pm Part 2: Pre-processing eye-tracking data

Day 2: Statistical analysis of eye-tracking data (GCA)

Introduction to tidyverse

tidyverse homepage https://www.tidyverse.org/

Introduction to tidyverse

tidyverse covers the major aspects of data wrangling in one integrated suite of packages

Pipes (magrittr)

Easy way to do a series of operations:

g(f(x, y), z) is the same as x %>% f(y) %>% g(z)

But easier to read and write: “Take x then do f(y) then do g(z)

Load the packages

library(tidyverse) # a shortcut that loads the core tidyverse packages
## -- Attaching packages ----------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  2.0.1     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0
## -- Conflicts -------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gazer) # example data and functions for analyzing eye-tracking data

Reading data

Read a SPSS file

dat <- haven::read_sav("filename.sav")

Read an Excel file

dat <- readxl::read_excel("filename.xls", 
                          sheet = "Sheet1", # optionally specify a sheet by name or number
                          skip = 2, # optionally skip some header rows
                          range ="C1:E7") # optionally specify a cell range

Note: haven and readxl are part of tidyverse, but are not “core” packages, so they need to be installed and loaded separately.

Manipulating data

summary(NamingRecovery)
##    SubjectID       Diagnosis     TestTime    Correct      
##  NR001  : 5   Anomic    :30   Min.   :0   Min.   :0.0227  
##  NR002  : 5   Conduction:45   1st Qu.:1   1st Qu.:0.4136  
##  NR003  : 5   Wernicke  :40   Median :2   Median :0.7278  
##  NR004  : 5                   Mean   :2   Mean   :0.6310  
##  NR005  : 5                   3rd Qu.:3   3rd Qu.:0.8850  
##  NR006  : 5                   Max.   :4   Max.   :0.9942  
##  (Other):85                                               
##  Semantic.error    Mixed.error       Formal.error    Unrelated.error 
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0146   1st Qu.:0.00695   1st Qu.:0.0174   1st Qu.:0.0000  
##  Median :0.0235   Median :0.01790   Median :0.0578   Median :0.0060  
##  Mean   :0.0354   Mean   :0.02215   Mean   :0.0881   Mean   :0.0606  
##  3rd Qu.:0.0492   3rd Qu.:0.02940   3rd Qu.:0.1383   3rd Qu.:0.0656  
##  Max.   :0.2000   Max.   :0.08780   Max.   :0.3778   Max.   :0.3540  
##                                                                      
##  Nonword.error   
##  Min.   :0.0000  
##  1st Qu.:0.0495  
##  Median :0.1279  
##  Mean   :0.1627  
##  3rd Qu.:0.2247  
##  Max.   :0.6518  
## 
ggplot(NamingRecovery, aes(TestTime, Correct, color=Diagnosis)) + 
  stat_summary(fun.y=mean, geom="line") +
  stat_summary(fun.y=mean, geom="point")

Manipulating data

For each of the five test times, calculate the overall average accuracy.

NamingRecovery %>% 
  group_by(TestTime) %>% 
  summarize(Accuracy = mean(Correct))
## # A tibble: 5 x 2
##   TestTime Accuracy
##      <dbl>    <dbl>
## 1        0    0.484
## 2        1    0.628
## 3        2    0.653
## 4        3    0.682
## 5        4    0.708

Manipulating data

Calculate the overall accuracy for each aphasia sub-type group.

NamingRecovery %>% 
  group_by(Diagnosis) %>% 
  summarize(Accuracy = mean(Correct))
## # A tibble: 3 x 2
##   Diagnosis  Accuracy
##   <fct>         <dbl>
## 1 Anomic        0.841
## 2 Conduction    0.670
## 3 Wernicke      0.429

Manipulating data

Calculate the average accuracy for each test time separately for each aphasia sub-type group

NamingRecovery %>% 
  group_by(Diagnosis, TestTime) %>% 
  summarize(Accuracy = mean(Correct))
## # A tibble: 15 x 3
## # Groups:   Diagnosis [?]
##    Diagnosis  TestTime Accuracy
##    <fct>         <dbl>    <dbl>
##  1 Anomic            0    0.661
##  2 Anomic            1    0.859
##  3 Anomic            2    0.880
##  4 Anomic            3    0.893
##  5 Anomic            4    0.914
##  6 Conduction        0    0.526
##  7 Conduction        1    0.650
##  8 Conduction        2    0.698
##  9 Conduction        3    0.728
## 10 Conduction        4    0.749
## 11 Wernicke          0    0.304
## 12 Wernicke          1    0.431
## 13 Wernicke          2    0.433
## 14 Wernicke          3    0.471
## 15 Wernicke          4    0.507

Manipulating data

Calculate the average accuracy for each test time separately for each aphasia sub-type group, and make a nice table.

NamingRecovery %>% 
  group_by(Diagnosis, TestTime) %>% 
  summarize(Accuracy = mean(Correct)) %>%
  spread(key = Diagnosis, value = Accuracy)
## # A tibble: 5 x 4
##   TestTime Anomic Conduction Wernicke
##      <dbl>  <dbl>      <dbl>    <dbl>
## 1        0  0.661      0.526    0.304
## 2        1  0.859      0.650    0.431
## 3        2  0.880      0.698    0.433
## 4        3  0.893      0.728    0.471
## 5        4  0.914      0.749    0.507

Organizing data: spread()

https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf

Organizing data: tidyr

Tidy data: Every column is a variable, every row is an observation

head(NamingRecovery)
##   SubjectID  Diagnosis TestTime Correct Semantic.error Mixed.error
## 1     NR001 Conduction        0  0.8313         0.0688      0.0250
## 3     NR002 Conduction        0  0.7679         0.0357      0.0357
## 4     NR003 Conduction        0  0.0368         0.0441      0.0294
## 5     NR004   Wernicke        0  0.1357         0.0643      0.0429
## 6     NR005 Conduction        0  0.8526         0.0385      0.0256
## 7     NR006     Anomic        0  0.8158         0.0614      0.0614
##   Formal.error Unrelated.error Nonword.error
## 1       0.0125          0.0063        0.0563
## 3       0.0238          0.0060        0.1310
## 4       0.3382          0.3529        0.1985
## 5       0.1643          0.2714        0.3214
## 6       0.0128          0.0064        0.0641
## 7       0.0351          0.0175        0.0088

Organizing data: gather()

https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf

Organizing data: gather()

NR.g <- gather(NamingRecovery, 
               key="ResponseType", value="Proportion",
               "Correct", "Semantic.error", "Mixed.error", # instead could use 4:9
                 "Formal.error", "Unrelated.error", "Nonword.error")
head(NR.g)
##   SubjectID  Diagnosis TestTime ResponseType Proportion
## 1     NR001 Conduction        0      Correct     0.8313
## 2     NR002 Conduction        0      Correct     0.7679
## 3     NR003 Conduction        0      Correct     0.0368
## 4     NR004   Wernicke        0      Correct     0.1357
## 5     NR005 Conduction        0      Correct     0.8526
## 6     NR006     Anomic        0      Correct     0.8158

Organizing data: gather()

ggplot(NR.g, aes(TestTime, Proportion, color=Diagnosis)) +
  facet_wrap(~ ResponseType, scales = "free_y") +
  stat_summary(fun.y=mean, geom="line") +
  stat_summary(fun.y=mean, geom="point")

Exercise

The USArrests data set contains violent crime arrests (per 100,000 residents) in each of the 50 states in the USA in 1973 and the percent of the population of each state that lived in urban areas.

summary(USArrests)
##      Murder         Assault       UrbanPop         Rape     
##  Min.   : 0.80   Min.   : 45   Min.   :32.0   Min.   : 7.3  
##  1st Qu.: 4.08   1st Qu.:109   1st Qu.:54.5   1st Qu.:15.1  
##  Median : 7.25   Median :159   Median :66.0   Median :20.1  
##  Mean   : 7.79   Mean   :171   Mean   :65.5   Mean   :21.2  
##  3rd Qu.:11.25   3rd Qu.:249   3rd Qu.:77.8   3rd Qu.:26.2  
##  Max.   :17.40   Max.   :337   Max.   :91.0   Max.   :46.0

A. Convert the USArrests data set from a wide to a long format so that instead of separate variables for each crime type (Murder, Assault, Rape), there is one variable that identifies the crime type and one variable that contains the rates for each crime type for each state.

B. Make a scatterplot showing the relationship between each type of violent crime rate and percent of population living in urban areas.

Exercise Solution

# A. Convert the USArrests data set from a wide to a long format
x <- gather(USArrests, key="CrimeType", value="Rate", 
            Murder, Assault, Rape)
# B. Make a scatterplot showing each type of violent crime rate and percent urban population
ggplot(x, aes(UrbanPop, Rate)) + 
  facet_wrap(~ CrimeType, scales="free", nrow=1) + 
  geom_point() + stat_smooth(method="lm")

Subsetting data

Example: from NamingRecovery get just the baseline accuracy and formal errors for participants with Conduction aphasia.

# base
NR.s <- subset(NamingRecovery, Diagnosis == "Conduction" & TestTime == 0, 
               select = c("SubjectID", "Diagnosis", "TestTime", "Correct", "Formal.error"))
NR.s
##    SubjectID  Diagnosis TestTime Correct Formal.error
## 1      NR001 Conduction        0  0.8313       0.0125
## 3      NR002 Conduction        0  0.7679       0.0238
## 4      NR003 Conduction        0  0.0368       0.3382
## 6      NR005 Conduction        0  0.8526       0.0128
## 10     NR009 Conduction        0  0.1386       0.1627
## 15    NR0013 Conduction        0  0.6023       0.1345
## 82    NR0019 Conduction        0  0.3050       0.1915
## 85    NR0022 Conduction        0  0.7627       0.0339
## 86    NR0023 Conduction        0  0.4348       0.1739
# dplyr
NR.s <- NamingRecovery %>% 
  filter(Diagnosis == "Conduction" & TestTime == 0) %>%
  select(SubjectID, Diagnosis, TestTime, Correct, Formal.error)
NR.s
##   SubjectID  Diagnosis TestTime Correct Formal.error
## 1     NR001 Conduction        0  0.8313       0.0125
## 2     NR002 Conduction        0  0.7679       0.0238
## 3     NR003 Conduction        0  0.0368       0.3382
## 4     NR005 Conduction        0  0.8526       0.0128
## 5     NR009 Conduction        0  0.1386       0.1627
## 6    NR0013 Conduction        0  0.6023       0.1345
## 7    NR0019 Conduction        0  0.3050       0.1915
## 8    NR0022 Conduction        0  0.7627       0.0339
## 9    NR0023 Conduction        0  0.4348       0.1739

Separate one column into multiple columns

Useful when you have values that are actually multiple distinct pieces of information stuck together.

dat <- data.frame(Condition = paste(rep(c("A", "B"), 3), rep(1:3, each=2), sep="_"))
dat
##   Condition
## 1       A_1
## 2       B_1
## 3       A_2
## 4       B_2
## 5       A_3
## 6       B_3
#dat %>% separate(Condition, into=c("Factor1", "Factor2"), sep="_")
separate(dat, Condition, into=c("Factor1", "Factor2"), sep="_")
##   Factor1 Factor2
## 1       A       1
## 2       B       1
## 3       A       2
## 4       B       2
## 5       A       3
## 6       B       3

Data transformations: dplyr::mutate()

mutate() is similar to summarize(), but returns a transformation of the data (same size) rather than a summary (one value).

Example: Convert each participant’s accuracy to proportion change from their baseline accuracy.

ex <- NamingRecovery %>%
  group_by(SubjectID) %>%
  mutate(AccChange = Correct/Correct[TestTime == 0])
#summary(ex)
ggplot(ex, aes(TestTime, AccChange)) + 
  stat_summary(fun.y=mean, geom="line") +
  stat_summary(fun.data=mean_se, geom="pointrange")

Writing functions

DRY: Don’t Repeat Yourself

Some additional general principles:

Function arguments

fun_name <- function(arg1, arg2, ..., arg_def=0){...}

addTwo <- function(x){
  result <- x+2
  return(result)
}
addTwo(10)
## [1] 12

Exercise: Fibonacci function

Fibonacci sequence: \(F_n = F_{n-1} + F_{n-2}\)

\(1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\)

Can be calculated using a for loop:

fib <- c(1,1) 
for(n in 3:100){
  fib[n] <- fib[n-1] + fib[n-2]
}
fib[1:15]
##  [1]   1   1   2   3   5   8  13  21  34  55  89 144 233 377 610

Write a function that takes one argument (x) and produces the first x digits of the Fibonacci sequence.

Exercise Solution

Write a function that takes one argument (x) and produces the first x digits of the Fibonacci sequence: \(F_n = F_{n-1} + F_{n-2}\).

fib_fun <- function(x){
  fib <- c(1,1) 
  for(i in 3:x){
    fib[i] <- fib[i-1] + fib[i-2]
  }
  return(fib)
}
fib_fun(15)
##  [1]   1   1   2   3   5   8  13  21  34  55  89 144 233 377 610

More about tidyverse and R programming

R for Data Science

R for Data Science

Read it online for free: https://r4ds.had.co.nz/

Check out RStudio Cheat Sheets: https://www.rstudio.com/resources/cheatsheets/