Dan Mirman
23 January 2019
Day 1: Data wrangling, Pre-processing eye-tracking data
Time | Content |
---|---|
9am - Noon | Part 1: Data wrangling |
Noon - 1:30pm | Lunch break, Q&A, own data analysis time |
1:30pm - 4:30pm | Part 2: Pre-processing eye-tracking data |
Day 2: Statistical analysis of eye-tracking data (GCA)
tidyverse
tidyverse
tidyverse
covers the major aspects of data wrangling in one integrated suite of packages
readr
, readxl
(.xls and .xlsx files), haven
(SPSS and SAS files)ggplot2
tidyr
): gather()
columns, spread()
unique values into columns, separate()
cells in a column into multiple columns dplyr
): select()
columns, filter()
cases, arrange()
rows in order, mutate()
variables, summarize
values%>%
magrittr
)Easy way to do a series of operations:
g(f(x, y), z)
is the same as x %>% f(y) %>% g(z)
But easier to read and write: “Take x
then do f(y)
then do g(z)
”
library(tidyverse) # a shortcut that loads the core tidyverse packages
## -- Attaching packages ----------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.2.5
## v tibble 2.0.1 v dplyr 0.7.8
## v tidyr 0.8.2 v stringr 1.3.1
## v readr 1.3.1 v forcats 0.3.0
## -- Conflicts -------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(gazer) # example data and functions for analyzing eye-tracking data
Read a SPSS file
dat <- haven::read_sav("filename.sav")
Read an Excel file
dat <- readxl::read_excel("filename.xls",
sheet = "Sheet1", # optionally specify a sheet by name or number
skip = 2, # optionally skip some header rows
range ="C1:E7") # optionally specify a cell range
Note: haven
and readxl
are part of tidyverse
, but are not “core” packages, so they need to be installed and loaded separately.
summary(NamingRecovery)
## SubjectID Diagnosis TestTime Correct
## NR001 : 5 Anomic :30 Min. :0 Min. :0.0227
## NR002 : 5 Conduction:45 1st Qu.:1 1st Qu.:0.4136
## NR003 : 5 Wernicke :40 Median :2 Median :0.7278
## NR004 : 5 Mean :2 Mean :0.6310
## NR005 : 5 3rd Qu.:3 3rd Qu.:0.8850
## NR006 : 5 Max. :4 Max. :0.9942
## (Other):85
## Semantic.error Mixed.error Formal.error Unrelated.error
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0146 1st Qu.:0.00695 1st Qu.:0.0174 1st Qu.:0.0000
## Median :0.0235 Median :0.01790 Median :0.0578 Median :0.0060
## Mean :0.0354 Mean :0.02215 Mean :0.0881 Mean :0.0606
## 3rd Qu.:0.0492 3rd Qu.:0.02940 3rd Qu.:0.1383 3rd Qu.:0.0656
## Max. :0.2000 Max. :0.08780 Max. :0.3778 Max. :0.3540
##
## Nonword.error
## Min. :0.0000
## 1st Qu.:0.0495
## Median :0.1279
## Mean :0.1627
## 3rd Qu.:0.2247
## Max. :0.6518
##
ggplot(NamingRecovery, aes(TestTime, Correct, color=Diagnosis)) +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.y=mean, geom="point")
For each of the five test times, calculate the overall average accuracy.
NamingRecovery %>%
group_by(TestTime) %>%
summarize(Accuracy = mean(Correct))
## # A tibble: 5 x 2
## TestTime Accuracy
## <dbl> <dbl>
## 1 0 0.484
## 2 1 0.628
## 3 2 0.653
## 4 3 0.682
## 5 4 0.708
Calculate the overall accuracy for each aphasia sub-type group.
NamingRecovery %>%
group_by(Diagnosis) %>%
summarize(Accuracy = mean(Correct))
## # A tibble: 3 x 2
## Diagnosis Accuracy
## <fct> <dbl>
## 1 Anomic 0.841
## 2 Conduction 0.670
## 3 Wernicke 0.429
Calculate the average accuracy for each test time separately for each aphasia sub-type group
NamingRecovery %>%
group_by(Diagnosis, TestTime) %>%
summarize(Accuracy = mean(Correct))
## # A tibble: 15 x 3
## # Groups: Diagnosis [?]
## Diagnosis TestTime Accuracy
## <fct> <dbl> <dbl>
## 1 Anomic 0 0.661
## 2 Anomic 1 0.859
## 3 Anomic 2 0.880
## 4 Anomic 3 0.893
## 5 Anomic 4 0.914
## 6 Conduction 0 0.526
## 7 Conduction 1 0.650
## 8 Conduction 2 0.698
## 9 Conduction 3 0.728
## 10 Conduction 4 0.749
## 11 Wernicke 0 0.304
## 12 Wernicke 1 0.431
## 13 Wernicke 2 0.433
## 14 Wernicke 3 0.471
## 15 Wernicke 4 0.507
Calculate the average accuracy for each test time separately for each aphasia sub-type group, and make a nice table.
NamingRecovery %>%
group_by(Diagnosis, TestTime) %>%
summarize(Accuracy = mean(Correct)) %>%
spread(key = Diagnosis, value = Accuracy)
## # A tibble: 5 x 4
## TestTime Anomic Conduction Wernicke
## <dbl> <dbl> <dbl> <dbl>
## 1 0 0.661 0.526 0.304
## 2 1 0.859 0.650 0.431
## 3 2 0.880 0.698 0.433
## 4 3 0.893 0.728 0.471
## 5 4 0.914 0.749 0.507
tidyr
Tidy data: Every column is a variable, every row is an observation
head(NamingRecovery)
## SubjectID Diagnosis TestTime Correct Semantic.error Mixed.error
## 1 NR001 Conduction 0 0.8313 0.0688 0.0250
## 3 NR002 Conduction 0 0.7679 0.0357 0.0357
## 4 NR003 Conduction 0 0.0368 0.0441 0.0294
## 5 NR004 Wernicke 0 0.1357 0.0643 0.0429
## 6 NR005 Conduction 0 0.8526 0.0385 0.0256
## 7 NR006 Anomic 0 0.8158 0.0614 0.0614
## Formal.error Unrelated.error Nonword.error
## 1 0.0125 0.0063 0.0563
## 3 0.0238 0.0060 0.1310
## 4 0.3382 0.3529 0.1985
## 5 0.1643 0.2714 0.3214
## 6 0.0128 0.0064 0.0641
## 7 0.0351 0.0175 0.0088
i
at test time j
. Are these data “tidy”?Correct
, Semantic.error
, Mixed.error
, etc.), each of those is a separate observation.gather()
NR.g <- gather(NamingRecovery,
key="ResponseType", value="Proportion",
"Correct", "Semantic.error", "Mixed.error", # instead could use 4:9
"Formal.error", "Unrelated.error", "Nonword.error")
head(NR.g)
## SubjectID Diagnosis TestTime ResponseType Proportion
## 1 NR001 Conduction 0 Correct 0.8313
## 2 NR002 Conduction 0 Correct 0.7679
## 3 NR003 Conduction 0 Correct 0.0368
## 4 NR004 Wernicke 0 Correct 0.1357
## 5 NR005 Conduction 0 Correct 0.8526
## 6 NR006 Anomic 0 Correct 0.8158
gather()
ggplot(NR.g, aes(TestTime, Proportion, color=Diagnosis)) +
facet_wrap(~ ResponseType, scales = "free_y") +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.y=mean, geom="point")
The USArrests
data set contains violent crime arrests (per 100,000 residents) in each of the 50 states in the USA in 1973 and the percent of the population of each state that lived in urban areas.
summary(USArrests)
## Murder Assault UrbanPop Rape
## Min. : 0.80 Min. : 45 Min. :32.0 Min. : 7.3
## 1st Qu.: 4.08 1st Qu.:109 1st Qu.:54.5 1st Qu.:15.1
## Median : 7.25 Median :159 Median :66.0 Median :20.1
## Mean : 7.79 Mean :171 Mean :65.5 Mean :21.2
## 3rd Qu.:11.25 3rd Qu.:249 3rd Qu.:77.8 3rd Qu.:26.2
## Max. :17.40 Max. :337 Max. :91.0 Max. :46.0
A. Convert the USArrests
data set from a wide to a long format so that instead of separate variables for each crime type (Murder, Assault, Rape), there is one variable that identifies the crime type and one variable that contains the rates for each crime type for each state.
B. Make a scatterplot showing the relationship between each type of violent crime rate and percent of population living in urban areas.
# A. Convert the USArrests data set from a wide to a long format
x <- gather(USArrests, key="CrimeType", value="Rate",
Murder, Assault, Rape)
# B. Make a scatterplot showing each type of violent crime rate and percent urban population
ggplot(x, aes(UrbanPop, Rate)) +
facet_wrap(~ CrimeType, scales="free", nrow=1) +
geom_point() + stat_smooth(method="lm")
base::merge()
and dplyr::*join()
do the same things. *join
is supposed to be faster and doesn’t mess with the order of the rows. (But I’m used to the syntax of merge()
, so that’s what I use).
By default, both merge()
and *join()
will use columns with identical names as the “key” variables for aligning data elements. If the variables do not have identical names (e.g., Participant
- ParticipantID
), use the by
argument to specify matching variables.
base::subset()
dplyr::filter()
, dplyr::select()
Example: from NamingRecovery
get just the baseline accuracy and formal errors for participants with Conduction aphasia.
# base
NR.s <- subset(NamingRecovery, Diagnosis == "Conduction" & TestTime == 0,
select = c("SubjectID", "Diagnosis", "TestTime", "Correct", "Formal.error"))
NR.s
## SubjectID Diagnosis TestTime Correct Formal.error
## 1 NR001 Conduction 0 0.8313 0.0125
## 3 NR002 Conduction 0 0.7679 0.0238
## 4 NR003 Conduction 0 0.0368 0.3382
## 6 NR005 Conduction 0 0.8526 0.0128
## 10 NR009 Conduction 0 0.1386 0.1627
## 15 NR0013 Conduction 0 0.6023 0.1345
## 82 NR0019 Conduction 0 0.3050 0.1915
## 85 NR0022 Conduction 0 0.7627 0.0339
## 86 NR0023 Conduction 0 0.4348 0.1739
# dplyr
NR.s <- NamingRecovery %>%
filter(Diagnosis == "Conduction" & TestTime == 0) %>%
select(SubjectID, Diagnosis, TestTime, Correct, Formal.error)
NR.s
## SubjectID Diagnosis TestTime Correct Formal.error
## 1 NR001 Conduction 0 0.8313 0.0125
## 2 NR002 Conduction 0 0.7679 0.0238
## 3 NR003 Conduction 0 0.0368 0.3382
## 4 NR005 Conduction 0 0.8526 0.0128
## 5 NR009 Conduction 0 0.1386 0.1627
## 6 NR0013 Conduction 0 0.6023 0.1345
## 7 NR0019 Conduction 0 0.3050 0.1915
## 8 NR0022 Conduction 0 0.7627 0.0339
## 9 NR0023 Conduction 0 0.4348 0.1739
Useful when you have values that are actually multiple distinct pieces of information stuck together.
dat <- data.frame(Condition = paste(rep(c("A", "B"), 3), rep(1:3, each=2), sep="_"))
dat
## Condition
## 1 A_1
## 2 B_1
## 3 A_2
## 4 B_2
## 5 A_3
## 6 B_3
#dat %>% separate(Condition, into=c("Factor1", "Factor2"), sep="_")
separate(dat, Condition, into=c("Factor1", "Factor2"), sep="_")
## Factor1 Factor2
## 1 A 1
## 2 B 1
## 3 A 2
## 4 B 2
## 5 A 3
## 6 B 3
dplyr::mutate()
mutate()
is similar to summarize()
, but returns a transformation of the data (same size) rather than a summary (one value).
Example: Convert each participant’s accuracy to proportion change from their baseline accuracy.
ex <- NamingRecovery %>%
group_by(SubjectID) %>%
mutate(AccChange = Correct/Correct[TestTime == 0])
#summary(ex)
ggplot(ex, aes(TestTime, AccChange)) +
stat_summary(fun.y=mean, geom="line") +
stat_summary(fun.data=mean_se, geom="pointrange")
DRY: Don’t Repeat Yourself
Some additional general principles:
?function_name
action_argument
or target_action
print
statements for debugging and monitoringfun_name <- function(arg1, arg2, ..., arg_def=0){...}
return
the resultaddTwo <- function(x){
result <- x+2
return(result)
}
addTwo(10)
## [1] 12
Fibonacci sequence: \(F_n = F_{n-1} + F_{n-2}\)
\(1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ...\)
Can be calculated using a for
loop:
fib <- c(1,1)
for(n in 3:100){
fib[n] <- fib[n-1] + fib[n-2]
}
fib[1:15]
## [1] 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610
Write a function that takes one argument (x
) and produces the first x
digits of the Fibonacci sequence.
Write a function that takes one argument (x
) and produces the first x
digits of the Fibonacci sequence: \(F_n = F_{n-1} + F_{n-2}\).
fib_fun <- function(x){
fib <- c(1,1)
for(i in 3:x){
fib[i] <- fib[i-1] + fib[i-2]
}
return(fib)
}
fib_fun(15)
## [1] 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610
tidyverse
and R programmingRead it online for free: https://r4ds.had.co.nz/
Check out RStudio Cheat Sheets: https://www.rstudio.com/resources/cheatsheets/