Eye-Tracking Analysis Workshop

Time	Content
9am - Noon	Part 1: Data wrangling
Noon - 1:30pm	Lunch break, Q&A, own data analysis time
1:30pm - 4:30pm	Part 2: Pre-processing eye-tracking data

Introduction to `tidyverse`

tidyverse covers the major aspects of data wrangling in one integrated suite of packages

Reading data: readr, readxl (.xls and .xlsx files), haven (SPSS and SAS files)
Data visualization: ggplot2
Data organization (tidyr): gather() columns, spread() unique values into columns, separate() cells in a column into multiple columns
Data manipulation (dplyr): select() columns, filter() cases, arrange() rows in order, mutate() variables, summarize values
Programming: %>%

Pipes (`magrittr`)

Easy way to do a series of operations:

g(f(x, y), z) is the same as x %>% f(y) %>% g(z)

But easier to read and write: “Take x then do f(y) then do g(z)”

Load the packages

library(tidyverse) # a shortcut that loads the core tidyverse packages

## -- Attaching packages ----------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0     v purrr   0.2.5
## v tibble  2.0.1     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0

## -- Conflicts -------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(gazer) # example data and functions for analyzing eye-tracking data

Reading data

Read a SPSS file

dat <- haven::read_sav("filename.sav")

Read an Excel file

dat <- readxl::read_excel("filename.xls", 
                          sheet = "Sheet1", # optionally specify a sheet by name or number
                          skip = 2, # optionally skip some header rows
                          range ="C1:E7") # optionally specify a cell range

Note: haven and readxl are part of tidyverse, but are not “core” packages, so they need to be installed and loaded separately.

Manipulating data

summary(NamingRecovery)

##    SubjectID       Diagnosis     TestTime    Correct      
##  NR001  : 5   Anomic    :30   Min.   :0   Min.   :0.0227  
##  NR002  : 5   Conduction:45   1st Qu.:1   1st Qu.:0.4136  
##  NR003  : 5   Wernicke  :40   Median :2   Median :0.7278  
##  NR004  : 5                   Mean   :2   Mean   :0.6310  
##  NR005  : 5                   3rd Qu.:3   3rd Qu.:0.8850  
##  NR006  : 5                   Max.   :4   Max.   :0.9942  
##  (Other):85                                               
##  Semantic.error    Mixed.error       Formal.error    Unrelated.error 
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0146   1st Qu.:0.00695   1st Qu.:0.0174   1st Qu.:0.0000  
##  Median :0.0235   Median :0.01790   Median :0.0578   Median :0.0060  
##  Mean   :0.0354   Mean   :0.02215   Mean   :0.0881   Mean   :0.0606  
##  3rd Qu.:0.0492   3rd Qu.:0.02940   3rd Qu.:0.1383   3rd Qu.:0.0656  
##  Max.   :0.2000   Max.   :0.08780   Max.   :0.3778   Max.   :0.3540  
##                                                                      
##  Nonword.error   
##  Min.   :0.0000  
##  1st Qu.:0.0495  
##  Median :0.1279  
##  Mean   :0.1627  
##  3rd Qu.:0.2247  
##  Max.   :0.6518  
##

ggplot(NamingRecovery, aes(TestTime, Correct, color=Diagnosis)) + 
  stat_summary(fun.y=mean, geom="line") +
  stat_summary(fun.y=mean, geom="point")

Manipulating data

For each of the five test times, calculate the overall average accuracy.

NamingRecovery %>% 
  group_by(TestTime) %>% 
  summarize(Accuracy = mean(Correct))

## # A tibble: 5 x 2
##   TestTime Accuracy
##      <dbl>    <dbl>
## 1        0    0.484
## 2        1    0.628
## 3        2    0.653
## 4        3    0.682
## 5        4    0.708

Manipulating data

Calculate the overall accuracy for each aphasia sub-type group.

NamingRecovery %>% 
  group_by(Diagnosis) %>% 
  summarize(Accuracy = mean(Correct))

## # A tibble: 3 x 2
##   Diagnosis  Accuracy
##   <fct>         <dbl>
## 1 Anomic        0.841
## 2 Conduction    0.670
## 3 Wernicke      0.429

Manipulating data

Calculate the average accuracy for each test time separately for each aphasia sub-type group

NamingRecovery %>% 
  group_by(Diagnosis, TestTime) %>% 
  summarize(Accuracy = mean(Correct))

## # A tibble: 15 x 3
## # Groups:   Diagnosis [?]
##    Diagnosis  TestTime Accuracy
##    <fct>         <dbl>    <dbl>
##  1 Anomic            0    0.661
##  2 Anomic            1    0.859
##  3 Anomic            2    0.880
##  4 Anomic            3    0.893
##  5 Anomic            4    0.914
##  6 Conduction        0    0.526
##  7 Conduction        1    0.650
##  8 Conduction        2    0.698
##  9 Conduction        3    0.728
## 10 Conduction        4    0.749
## 11 Wernicke          0    0.304
## 12 Wernicke          1    0.431
## 13 Wernicke          2    0.433
## 14 Wernicke          3    0.471
## 15 Wernicke          4    0.507

Manipulating data

Calculate the average accuracy for each test time separately for each aphasia sub-type group, and make a nice table.

NamingRecovery %>% 
  group_by(Diagnosis, TestTime) %>% 
  summarize(Accuracy = mean(Correct)) %>%
  spread(key = Diagnosis, value = Accuracy)

## # A tibble: 5 x 4
##   TestTime Anomic Conduction Wernicke
##      <dbl>  <dbl>      <dbl>    <dbl>
## 1        0  0.661      0.526    0.304
## 2        1  0.859      0.650    0.431
## 3        2  0.880      0.698    0.433
## 4        3  0.893      0.728    0.471
## 5        4  0.914      0.749    0.507

Organizing data: `spread()`

https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf

Organizing data: `tidyr`

Tidy data: Every column is a variable, every row is an observation

head(NamingRecovery)

##   SubjectID  Diagnosis TestTime Correct Semantic.error Mixed.error
## 1     NR001 Conduction        0  0.8313         0.0688      0.0250
## 3     NR002 Conduction        0  0.7679         0.0357      0.0357
## 4     NR003 Conduction        0  0.0368         0.0441      0.0294
## 5     NR004   Wernicke        0  0.1357         0.0643      0.0429
## 6     NR005 Conduction        0  0.8526         0.0385      0.0256
## 7     NR006     Anomic        0  0.8158         0.0614      0.0614
##   Formal.error Unrelated.error Nonword.error
## 1       0.0125          0.0063        0.0563
## 3       0.0238          0.0060        0.1310
## 4       0.3382          0.3529        0.1985
## 5       0.1643          0.2714        0.3214
## 6       0.0128          0.0064        0.0641
## 7       0.0351          0.0175        0.0088

Each row is a testing occasion: participant i at test time j. Are these data “tidy”?
Yes: an observation is one administration of the naming test
No: the test produces mutliple measures (Correct, Semantic.error, Mixed.error, etc.), each of those is a separate observation.

Organizing data: `gather()`

https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf

Organizing data: `gather()`

NR.g <- gather(NamingRecovery, 
               key="ResponseType", value="Proportion",
               "Correct", "Semantic.error", "Mixed.error", # instead could use 4:9
                 "Formal.error", "Unrelated.error", "Nonword.error")
head(NR.g)

##   SubjectID  Diagnosis TestTime ResponseType Proportion
## 1     NR001 Conduction        0      Correct     0.8313
## 2     NR002 Conduction        0      Correct     0.7679
## 3     NR003 Conduction        0      Correct     0.0368
## 4     NR004   Wernicke        0      Correct     0.1357
## 5     NR005 Conduction        0      Correct     0.8526
## 6     NR006     Anomic        0      Correct     0.8158

Organizing data: `gather()`

ggplot(NR.g, aes(TestTime, Proportion, color=Diagnosis)) +
  facet_wrap(~ ResponseType, scales = "free_y") +
  stat_summary(fun.y=mean, geom="line") +
  stat_summary(fun.y=mean, geom="point")

Exercise

The USArrests data set contains violent crime arrests (per 100,000 residents) in each of the 50 states in the USA in 1973 and the percent of the population of each state that lived in urban areas.

summary(USArrests)

##      Murder         Assault       UrbanPop         Rape     
##  Min.   : 0.80   Min.   : 45   Min.   :32.0   Min.   : 7.3  
##  1st Qu.: 4.08   1st Qu.:109   1st Qu.:54.5   1st Qu.:15.1  
##  Median : 7.25   Median :159   Median :66.0   Median :20.1  
##  Mean   : 7.79   Mean   :171   Mean   :65.5   Mean   :21.2  
##  3rd Qu.:11.25   3rd Qu.:249   3rd Qu.:77.8   3rd Qu.:26.2  
##  Max.   :17.40   Max.   :337   Max.   :91.0   Max.   :46.0

A. Convert the USArrests data set from a wide to a long format so that instead of separate variables for each crime type (Murder, Assault, Rape), there is one variable that identifies the crime type and one variable that contains the rates for each crime type for each state.

B. Make a scatterplot showing the relationship between each type of violent crime rate and percent of population living in urban areas.

Exercise Solution

# A. Convert the USArrests data set from a wide to a long format
x <- gather(USArrests, key="CrimeType", value="Rate", 
            Murder, Assault, Rape)
# B. Make a scatterplot showing each type of violent crime rate and percent urban population
ggplot(x, aes(UrbanPop, Rate)) + 
  facet_wrap(~ CrimeType, scales="free", nrow=1) + 
  geom_point() + stat_smooth(method="lm")

Merging related data

base::merge() and dplyr::*join() do the same things. *join is supposed to be faster and doesn’t mess with the order of the rows. (But I’m used to the syntax of merge(), so that’s what I use).

By default, both merge() and *join() will use columns with identical names as the “key” variables for aligning data elements. If the variables do not have identical names (e.g., Participant - ParticipantID), use the by argument to specify matching variables.

Subsetting data

base::subset()
dplyr::filter(), dplyr::select()

Example: from NamingRecovery get just the baseline accuracy and formal errors for participants with Conduction aphasia.

# base
NR.s <- subset(NamingRecovery, Diagnosis == "Conduction" & TestTime == 0, 
               select = c("SubjectID", "Diagnosis", "TestTime", "Correct", "Formal.error"))
NR.s

##    SubjectID  Diagnosis TestTime Correct Formal.error
## 1      NR001 Conduction        0  0.8313       0.0125
## 3      NR002 Conduction        0  0.7679       0.0238
## 4      NR003 Conduction        0  0.0368       0.3382
## 6      NR005 Conduction        0  0.8526       0.0128
## 10     NR009 Conduction        0  0.1386       0.1627
## 15    NR0013 Conduction        0  0.6023       0.1345
## 82    NR0019 Conduction        0  0.3050       0.1915
## 85    NR0022 Conduction        0  0.7627       0.0339
## 86    NR0023 Conduction        0  0.4348       0.1739

# dplyr
NR.s <- NamingRecovery %>% 
  filter(Diagnosis == "Conduction" & TestTime == 0) %>%
  select(SubjectID, Diagnosis, TestTime, Correct, Formal.error)
NR.s

##   SubjectID  Diagnosis TestTime Correct Formal.error
## 1     NR001 Conduction        0  0.8313       0.0125
## 2     NR002 Conduction        0  0.7679       0.0238
## 3     NR003 Conduction        0  0.0368       0.3382
## 4     NR005 Conduction        0  0.8526       0.0128
## 5     NR009 Conduction        0  0.1386       0.1627
## 6    NR0013 Conduction        0  0.6023       0.1345
## 7    NR0019 Conduction        0  0.3050       0.1915
## 8    NR0022 Conduction        0  0.7627       0.0339
## 9    NR0023 Conduction        0  0.4348       0.1739

Separate one column into multiple columns

Useful when you have values that are actually multiple distinct pieces of information stuck together.

dat <- data.frame(Condition = paste(rep(c("A", "B"), 3), rep(1:3, each=2), sep="_"))
dat

##   Condition
## 1       A_1
## 2       B_1
## 3       A_2
## 4       B_2
## 5       A_3
## 6       B_3

#dat %>% separate(Condition, into=c("Factor1", "Factor2"), sep="_")
separate(dat, Condition, into=c("Factor1", "Factor2"), sep="_")

##   Factor1 Factor2
## 1       A       1
## 2       B       1
## 3       A       2
## 4       B       2
## 5       A       3
## 6       B       3

Data transformations: `dplyr::mutate()`

mutate() is similar to summarize(), but returns a transformation of the data (same size) rather than a summary (one value).

Example: Convert each participant’s accuracy to proportion change from their baseline accuracy.

ex <- NamingRecovery %>%
  group_by(SubjectID) %>%
  mutate(AccChange = Correct/Correct[TestTime == 0])
#summary(ex)
ggplot(ex, aes(TestTime, AccChange)) + 
  stat_summary(fun.y=mean, geom="line") +
  stat_summary(fun.data=mean_se, geom="pointrange")

Writing functions

DRY: Don’t Repeat Yourself

Some additional general principles:

Naming conventions
- Avoid over-writing existing functions. Tip: to check if a function name already exists use ?function_name
- Use sensible names: e.g., action_argument or target_action
Comment your code; use special comments to organize code into chunks
Insert print statements for debugging and monitoring

Function arguments

fun_name <- function(arg1, arg2, ..., arg_def=0){...}

Try to make sure that everything the function needs will be provided in the input
- Environment and scoping: the function environment is (somewhat) isolated from your main workspace environment.
- Good: intermediate steps and variables inside your function don’t clutter up your enviornment.
- Danger: function shouldn’t rely on knowing anything other than its own arguments.
- Arguments can be optional if you give them default values
return the result

addTwo <- function(x){
  result <- x+2
  return(result)
}
addTwo(10)

## [1] 12

Exercise: Fibonacci function

Fibonacci sequence: \(F_n = F_{n-1} + F_{n-2}\)

\(1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\)

Can be calculated using a for loop:

fib <- c(1,1) 
for(n in 3:100){
  fib[n] <- fib[n-1] + fib[n-2]
}
fib[1:15]

##  [1]   1   1   2   3   5   8  13  21  34  55  89 144 233 377 610

Write a function that takes one argument (x) and produces the first x digits of the Fibonacci sequence.

Exercise Solution

Write a function that takes one argument (x) and produces the first x digits of the Fibonacci sequence: \(F_n = F_{n-1} + F_{n-2}\).

fib_fun <- function(x){
  fib <- c(1,1) 
  for(i in 3:x){
    fib[i] <- fib[i-1] + fib[i-2]
  }
  return(fib)
}
fib_fun(15)

##  [1]   1   1   2   3   5   8  13  21  34  55  89 144 233 377 610

Eye-Tracking Analysis Workshop

Workshop Schedule

Introduction to `tidyverse`

Introduction to `tidyverse`

Pipes (`magrittr`)

Load the packages

Reading data

Manipulating data

Manipulating data

Manipulating data

Manipulating data

Manipulating data

Organizing data: `spread()`

Organizing data: `tidyr`

Organizing data: `gather()`

Organizing data: `gather()`

Organizing data: `gather()`

Exercise

Exercise Solution

Subsetting data

Separate one column into multiple columns

Data transformations: `dplyr::mutate()`

Writing functions

Function arguments

Exercise: Fibonacci function

Exercise Solution

More about `tidyverse` and R programming

Eye-Tracking Analysis Workshop

Workshop Schedule

Introduction to tidyverse

Introduction to tidyverse

Pipes (magrittr)

Load the packages

Reading data

Manipulating data

Manipulating data

Manipulating data

Manipulating data

Manipulating data

Organizing data: spread()

Organizing data: tidyr

Organizing data: gather()

Organizing data: gather()

Organizing data: gather()

Exercise

Exercise Solution

Merging related data

Merging related data

Subsetting data

Separate one column into multiple columns

Data transformations: dplyr::mutate()

Writing functions

Function arguments

Exercise: Fibonacci function

Exercise Solution

More about tidyverse and R programming

Introduction to `tidyverse`

Introduction to `tidyverse`

Pipes (`magrittr`)

Organizing data: `spread()`

Organizing data: `tidyr`

Organizing data: `gather()`

Organizing data: `gather()`

Organizing data: `gather()`

Data transformations: `dplyr::mutate()`

More about `tidyverse` and R programming