This script examines the learning effects in the “SeeRel1” (or “SR1”) experiment. It also adds 11 new columns to the data frame with the main data as documented at the end of this preamble. (See earlier scripts “./SR1_exploratory1.Rmd” and “./SR1_CogSci2024.Rmd” for more information about the experimental design, working hypotheses, etc.)
We submitted a six-page report of the SR1 experiment to the Cognitive Science 2024 conference. (See “…papers/conferences/2024/CogSci2024/submit1/”). All four reviewers pointed out learning effects as the major methodological concern that undermines the interpretability of our results. Indeed, a major flaw of our design is that the video session always occurs on Day 1 and the linguistic session on Day 2. This makes it hard to interpret the lack of priming effects in the linguistic condition. A deflationary alternative interpretation is that it is a null result due to over-training.
To remedy this problem, this script estimates the effects of practice in an attempt to partial them out statistically and thereby improve the interpretability of the main analysis. The main idea is to conduct the substantive analyses not on the raw RT data but on the residuals of a non-linear regression of RT on training time. The present script calculates these residuals and stores them in the data frame under the name “DRT” (short for “detrended response time”). A follow-up script (“./SR1_exploratory2.Rmd”) will analyze these DRTs as a function of relational role assignment, etc.
The present script loads a dataframe called SR1_data
generated by “./SR1_get_data2.Rmd”. Data from the 15 participants who
pass the exclusion criteria are concatenated together. Each row in the
dataframe represents one experimental trial. The variables (columns) are
documented in “./SR1_get_data2.Rmd”.
The present script creates a new RData file that contains an
augmented version of the same dataframe. The first 32 columns in the
augmented dataframe are identical with the columns in the original data
set that was loaded from “../data/SR1_main_data.RData”.
Here we defines following 11 new columns:
c(“clippedRT”,
“lcType”, “lcRSE”, “exp_tau”, “delta_AIC”, “RT_lcurve”, “residRT”)
For documentation of variables 1 through 32, see the preamble of “./SR1_get_data2.Rmd”. The new variables are as follows:
This RData file is loaded by the scripts that perform the substantive analyses in May 2024. We recommend that most substantive analyses are performed on the subset of SR1_data for which filter3==TRUE. (Note: filter THREE, not filter TWO.) To access the “good” data, use the following code:
D <- subset(SR1_data, filter3)
Where to go from here: First, check out “./SR1_exploratory2.Rmd”. The publication-quality analyzes and figures will be produced by “./SR1_CogSci2024_final.Rmd”
To run this script on your computer, you may have to change the
“trial_data_path” below. The file.choose()
function will be
your friend in this regard.
## Loading objects:
## SR1_data
## SR1_sbj
## meta_data
## descriptor1: Raw data of SeeRel1 experiment. Collected in May-June 2023 at OSU.
## descriptor2: Imported in R using SR1_get_data1.Rmd on Fri Jul 14 20:50:22 2023.
## SeeRel1_root: /Users/petrov.11/work/RelExper/SeeRel1
## data_path: data
## raw_data.csv: SeeRelation_allData_formatted.csv
## trial_data.RData: SR1_trial_data.RData
## analysis_path: analysis
## descriptor3: Dataframe SR1_trials loaded by SR1_get_data2.Rmd on Mon Jul 24 19:08:12 2023.
## trial_data_path: /Users/petrov.11/work/RelExper/SeeRel1/data/SR1_main_data.RData
## dropped_subj: c("16832", "18364", "24965", "29364", "29725", "38106")
## zRT_threshold: 3
## main_data.RData: SR1_main_data.RData
## descriptor4: Dataframe SR1_data[23040x32] loaded by SR1_learning_effect1.Rmd on Thu May 9 14:57:22 2024.
The data frame “SR1_data” contains data for the 15 participants who passed certain inclusion criteria. (See “./SR1_get_data2.Rmd”). Each participant performed two sessions of 768 trials each. (Zero entries correspond to participants who were excluded from the sample.)
## subjNum
## session 10889 16832 18364 19616 24965 29364 29725 30474 30723 38106 46991 52502
## video 768 0 0 768 0 0 0 768 768 0 768 768
## ling 768 0 0 768 0 0 0 768 768 0 768 768
## subjNum
## session 53927 55464 68030 69393 71440 79802 80116 87949 98309
## video 768 768 768 768 768 768 768 768 768
## ling 768 768 768 768 768 768 768 768 768
One trial-level exclusion criterion is that the response time should
be neither too fast nor too slow: 200 <= RT < 2000. (See
SR1_data$filter1
calculated by “./SR1_get_data2.Rmd”.) It
makes sense here to clip the raw RTs to this range. Only 60 raw RTs were
> 2000 ms and only 48 raw RTs were < 200 ms.
## The clip limits are global parameters:
PAR$RT_limits <- c(200, 2000)
meta_data$RT_limits <- PAR$RT_limits # so that they will be saved in the .mat file
## Enforce the limits:
SR1_data$clippedRT <- pmax(PAR$RT_limits[1], pmin(PAR$RT_limits[2], SR1_data$RT))
As a very crude initial exploration, let’s just average the (clipped) RT data across the 15 individual participants. We’ll deal with individual differences later. We store the aggregated data in a data frame called “SR1_groupRT_data”:
## Average the RTs across the individual participants:
SR1_groupRT_data <- ddply(SR1_data, .(session, trialNum), function(D1) {
data.frame(
RT_mean = mean(D1$clippedRT),
N_sbj = nrow(D1)
)})
## Verify that there is an equal number of replications in each condition:
stopifnot(is.constant(SR1_groupRT_data$N_sbj))
## The number of participants in the sample is a global parameter:
PAR$N_sbj <- constant(SR1_groupRT_data$N_sbj)
## Scatter-plot the group-averaged RTs:
gp$RT_mean_scatter1 <- ggplot(SR1_groupRT_data,
aes(x = trialNum, y = RT_mean, group=session, color=session, shape=session)) +
geom_point()
print(gp$RT_mean_scatter1)
Plot the two sessions in chronological order rather than one underneath the other:
## The number of trials per session is a global parameter:
PAR$N_trials_per_session <- max(SR1_data$trialNum) # 768
## Helper function:
convert_trialNum_to_trialNum2 <- function(trialNum, session)
(trialNum + ifelse(session=="video", 0, PAR$N_trials_per_session))
## Define a new time variable:
SR1_groupRT_data$trialNum2 <- (with (SR1_groupRT_data,
convert_trialNum_to_trialNum2(trialNum, session)))
## Scatter-plot 2: same data, new X axis
gp$RT_mean_scatter2 <- ggplot(SR1_groupRT_data,
aes(x = trialNum2, y = RT_mean, group=session, color=session, shape=session)) +
geom_point()
print(gp$RT_mean_scatter2)
Let’s smooth the data by averaging across mini-blocks of 8 consecutive trials:
## Global parameter:
#PAR$miniblock_size = 16
PAR$miniblock_size = 8
## Helper function:
convert_trialNum_to_trialMblk <- function(trialNum, miniblock_size = PAR$miniblock_size)
(floor((trialNum-1) / miniblock_size) * miniblock_size + floor(miniblock_size/2))
## Define a new variable trialMblk that collapses each miniblock into a single point:
SR1_groupRT_data$trialMblk <- convert_trialNum_to_trialMblk(SR1_groupRT_data$trialNum)
SR1_groupRT_data$trialMblk2 <- (with (SR1_groupRT_data,
convert_trialNum_to_trialNum2(trialMblk, session)))
## Average the RTs miniblocks:
SR1_groupRT_Mblk <- ddply(SR1_groupRT_data, .(session, trialMblk), function(D1) {
data.frame(
trialMblk2 = constant(D1$trialMblk2),
RT_mean = mean(D1$RT_mean),
N_obs = nrow(D1) * constant(D1$N_sbj, verifyp=FALSE) # number of observations
)})
## Sanity check: The number of observations in each miniblock should be
# 120 = 15 subjects * 8 trials/miniblock = const
stopifnot(constant(SR1_groupRT_Mblk$N_obs) == PAR$N_sbj * PAR$miniblock_size)
## Scatter-plot the group-averaged RTs:
gp$RT_mean_scatter3 <- ggplot(SR1_groupRT_Mblk,
aes(x = trialMblk2, y = RT_mean, group=session, color=session, shape=session)) +
geom_point()
print(gp$RT_mean_scatter3)
nls
Regression of the Group Data from the Video
SessionWe will model the learning curves as exponential functions
(Heathcote, Brown, & Mewhort, 2000, Psychonomic Bulletin &
Review, 7, 185-207). We parameterize these exponential functions
following Equation 4 of Petrov & Hayes (2010,
Journal of Vision, 10(14):11). Each learning curve is defined by
three parameters: y192
, y576
, and
tau
. The first two parameters fix the mean RTs at two
reference (or “pivot”) trials: 192 and 576 since the beginning of the
session. These reference trials were chosen a priori as a function of
session length. The estimated learning cirves (as opposed to their
parameters) should be relatively insensitive to the choice of
pivot times. See “./expTransf/expTransf_example.Rmd” for a related
example.
## Fix the reference trials a priori:
PAR$ref_trials <- c(1/4, 3/4) * PAR$N_trials_per_session # c(192, 576)
## Helper function:
find_RT_at_reference_trials <- function(t, RT, ref_trials = PAR$ref_trials) {
stopifnot(length(t) == length(RT)) # must be vectors of equal length
# Find the RT at the first reference trial:
#y192 <- RT[t == PAR$ref_trials[1]]
y192 <- RT[which.min(abs(t - PAR$ref_trials[1]))]
# Find the RT at the first reference trial:
#y576 <- RT[t == PAR$ref_trials[2]]
y576 <- RT[which.min(abs(t - PAR$ref_trials[2]))]
# Pack as a list and return:
list(y192 = y192, y576 = y576)
}
The nonlinear, decelerating shape of the learning curves is governed
by the time-constant parameter tau
. It is measured in units
of time – that is, trials. Thus the ratio t/tau
is
dimensionless and can be passed as an argument to exp()
. We
specify the exact exponential equation by means of an object of class
formula
, which can be passed to nls()
to
estimate the parameters (Venables & Ripley, 2010, Modern Applied
Statistics with S).
Note that the reference times are hardwired in the formula. Make sure
to redefine the formula if you change PAR$ref_trials
.
## Make the formula object. This one formula will be used for all model fits:
M$nls_formula <- RT ~ (y192 * ( exp(-t/tau) - exp(-576/tau)) +
y576 * (exp(-192/tau) - exp(-t/tau))) /
(exp(-192/tau) - exp(-576/tau))
## Sanity check. Because the reference times are hardwired in the formula,
# assert that they match the corresponding values in PAR$ref_trials:
stopifnot(PAR$ref_trials[1] == 192)
stopifnot(PAR$ref_trials[2] == 576)
## Initial guess for the time-constant parameter:
PAR$guess_tau = 150 # trials
The following function implements the exponential model directly:
## Take an integer index into model_fits and return an object of class "lm" or "nls":
exp_model <- function(t, y192, y576, tau,
t192 = PAR$ref_trials[1], t576 = PAR$ref_trials[2]) {
# Calculate subexpressions:
exp_t_tau <- exp( -t/tau)
exp_t192_tau <- exp(-t192/tau)
exp_t576_tau <- exp(-t576/tau)
# Combine the subexpressions into the return value:
((y192 * (exp_t_tau - exp_t576_tau) +
y576 * (exp_t192_tau - exp_t_tau) ) /
(exp_t192_tau - exp_t576_tau))
}
Let’s fit this exponential learning model to the group data for the video session.
The code below uses the following template: For each analysis, the
relevant input data is formatted as a data frame with 768 rows and
stored in the variable D. The initial guess for the parameter vector is
stored in the variable theta0. Then stats::nls()
is called
to perform the model fit using nonlinear least squares. The resulting
nls model object is stored in M under a unique name.
## Extract the data and store it in D:
D <- with(subset(SR1_groupRT_data, session == "video"),
data.frame(session = session, t = trialNum, RT = RT_mean))
stopifnot(nrow(D) == PAR$N_trials_per_session)
## Construct an initial guess for the three parameters:
theta0 <- find_RT_at_reference_trials(D$t, D$RT)
theta0$tau <- PAR$guess_tau
## Call nls() to do the real work:
currM <- nls(M$nls_formula, D, theta0)
M$nls_group_video <- currM
## Print a summary of the results and store it as an object that can be unpacked:
(currS <- summary(currM))
##
## Formula: RT ~ (y192 * (exp(-t/tau) - exp(-576/tau)) + y576 * (exp(-192/tau) -
## exp(-t/tau)))/(exp(-192/tau) - exp(-576/tau))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## y192 523.579 2.844 184.1 <2e-16 ***
## y576 479.549 2.175 220.5 <2e-16 ***
## tau 123.594 8.523 14.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.66 on 765 degrees of freedom
##
## Number of iterations to convergence: 8
## Achieved convergence tolerance: 8.263e-06
Last but not least, the parameter estimates (and their corresponding standard errors) are extracted using this helper function:
## A function that takes an nls model object and returns a one-row data frame:
extract_param_estimates <- function (currS) {
data.frame(y192 = coef(currS)[["y192", "Estimate"]],
y192_se = coef(currS)[["y192", "Std. Error"]],
y576 = coef(currS)[["y576", "Estimate"]],
y576_se = coef(currS)[["y576", "Std. Error"]],
tau = coef(currS)[["tau", "Estimate"]],
tau_se = coef(currS)[["tau", "Std. Error"]],
RSE = currS$sigma,
status = currS$convInfo[["stopMessage"]])
} # extract_param_estimates
## Extract parameter estimates and store them as a row in param_estimates:
group_params <- cbind(data.frame(model = "nls_group_video",
session = constant(D$session)),
extract_param_estimates(currS))
print(group_params, digits = 3)
## model session y192 y192_se y576 y576_se tau tau_se RSE status
## 1 nls_group_video video 524 2.84 480 2.18 124 8.52 43.7 converged
Add the fitted values and residuals to the data frame:
## Estimated learning curve:
D$RT_hat <- fitted(currM)
## Residuals. The sign must be flipped:
D$RT_resid <-- -residuals(currM)
## The residuals should have mean==0.0 and std.dev==RSE*(n-1)/n:
summary(D$RT_resid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -142.859 -28.111 -7.387 0.000 20.034 249.553
cat(sprintf("sd(RT_resid) = %.2f, RSE = %.2f.", sd(D$RT_resid), sigma(currM)))
## sd(RT_resid) = 43.60, RSE = 43.66.
Sanity check: Verify that the function exp_model()
produces the same estimated learning curve (within rounding error):
## Calculate the fitted values directly:
D$RT_hat1 <- exp_model(D$t, group_params$y192, group_params$y576, group_params$tau)
## Sanity check:
stopifnot(sum((D$RT_hat - D$RT_hat1)^2) < 1e-10) # allowance for rounding errors
Superimpose the fitted learning curve onto the raw scatterplot:
## Scatter-plot with a superimposed learning curve:
gp$RT_group_lcurve1 <- ggplot(D,
aes(x = t, y = RT, shape=session)) +
geom_point(color="#AAAAAA") +
geom_line(aes(y = RT_hat, color=session), linewidth=1)
print(gp$RT_group_lcurve1)
## Plot the residuals:
gp$RT_group_residuals1 <- ggplot(D,
aes(x = t, y = RT_resid, shape=session, color=session)) +
geom_point()
print(gp$RT_group_residuals1)
nls
Regression of the Group Data from the Video
SessionRepeat the exercise for the group data from the linguistic session.
It turns out that nls()
does not converge on these data.
This is because the data are scattered basically around a straight line,
which cannot be fit well with an exponential function starting at
trial=1.
## Extract the linguistic data and store it in D:
D <- with(subset(SR1_groupRT_data, session == "ling"),
data.frame(session = session, t = trialNum, RT = RT_mean))
stopifnot(nrow(D) == PAR$N_trials_per_session)
## Construct an initial guess for the three parameters:
theta0 <- find_RT_at_reference_trials(D$t, D$RT)
theta0$tau <- PAR$guess_tau
## Call nls() to do the real work.
# Because it generates an error under default setting, we must change
# the control setting to issue a warning instead of a fatal error:
currM <- nls(M$nls_formula, D, theta0, control= list(warnOnly = TRUE))
## Warning in nls(M$nls_formula, D, theta0, control = list(warnOnly = TRUE)):
## number of iterations exceeded maximum of 50
M$nls_group_ling <- currM
## Print a summary of the results and store it as an object that can be unpacked:
(currS <- summary(currM))
##
## Formula: RT ~ (y192 * (exp(-t/tau) - exp(-576/tau)) + y576 * (exp(-192/tau) -
## exp(-t/tau)))/(exp(-192/tau) - exp(-576/tau))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## y192 398.808 1.248 319.452 <2e-16 ***
## y576 398.808 1.248 319.431 <2e-16 ***
## tau 14.075 5.475 2.571 0.0103 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.3 on 765 degrees of freedom
##
## Number of iterations till stop: 50
## Achieved convergence tolerance: 0.1113
## Reason stopped: number of iterations exceeded maximum of 50
## Extract parameter estimates and add them as a new row in param_estimates:
temp <- cbind(data.frame(model = "nls_group_ling",
session = constant(D$session)),
extract_param_estimates(currS))
group_params <- rbind(group_params, temp) # add temp as a new row at the bottom
print(group_params, digits = 3)
## model session y192 y192_se y576 y576_se tau tau_se RSE
## 1 nls_group_video video 524 2.84 480 2.18 123.6 8.52 43.7
## 2 nls_group_ling ling 399 1.25 399 1.25 14.1 5.48 33.3
## status
## 1 converged
## 2 number of iterations exceeded maximum of 50
It is probably a better idea to fit a simpler, linear model to the data from the linguistic session. Before we commit to this, however, let’s look at the raw data of the individual participants.
## Remove temporary variables from the workspace:
rm("temp", "D", "theta0", "currM", "currS")
Let’s make RT scatterplots separately for each individual participant. To make the plots easier to “fit by eye”, it is useful to smooth the data by averaging across miniblocks. We define a new parameter PAR\(miniblock_size1 -- it is the individual-level counterpart of the group-level PAR\)miniblock_size.
## Miniblocks at the individual level:
PAR$miniblock_size1 <- 16 # do not confuse with PAR$miniblock_size
## Define a new variable trialMblk that collapses each miniblock into a single point:
# Note: This time we modify the main dataframe SR1_data instead of SR1_groupRT_data.
SR1_data$trialMblk <-
convert_trialNum_to_trialMblk(SR1_data$trialNum, PAR$miniblock_size1)
## Average the RTs miniblocks:
SR1_indivRT_Mblk <- ddply(SR1_data, .(subjNum, session, trialMblk), function(D1) {
data.frame(
RT_mean = mean(D1$clippedRT),
N_obs = nrow(D1) # number of observations
)})
## Sanity check:
stopifnot(constant(SR1_indivRT_Mblk$N_obs) == PAR$miniblock_size1)
## Alternative time variable -- useful for plotting sessions side-by-side:
SR1_indivRT_Mblk$trialMblk2 <- (with (SR1_indivRT_Mblk,
convert_trialNum_to_trialNum2(trialMblk, session)))
## Scatter-plot the miniblock-averaged RTs separately for each individual:
gp$RT_indiv_scatter1 <- ggplot(SR1_indivRT_Mblk,
aes(x = trialMblk2, y = RT_mean, group=session, color=session, shape=session)) +
geom_point() + facet_wrap(vars(subjNum))
print(gp$RT_indiv_scatter1)
Our ultimate goal is to remove the learning effects from the RT data by regressing away the learning curve separately for each participant in each session. It seems clear from eyeballing the individual scatterplots that we should consider two alternative models – exponential and linear.
Let’s adopt the following strategy: We are going to fit both
models for the clipped RT data of each individual participant in each
session. If nls()
fails to converge on a given data set
with default control parameters, we decide that the exponential model is
not a viable model for these data. If nls()
does
converge, then we do a model comparison to decide whether the
exponential model (which has one extra parameter) or the linear models
should be preferred.
As it will soon become clear, we need to smooth out the raw data to
facilitate convergence of the nls()
algorithm. The
smoothing is performed by averaging within miniblocks. The following
helper function accomplishes this:
## Helper function that produces a data frame D with SMOOTHED data for
# a given participant in a given session:
extract_and_smooth_indiv_RTs <- function(which_subjNum, which_session, miniblock_size,
big_data = SR1_data) {
# Extract the relevant subset of the big data frame. It has 768 rows:
D768 <- subset(big_data, subjNum == which_subjNum & session == which_session)
stopifnot(nrow(D768) == PAR$N_trials_per_session) # 768
# Re-calculate trialMblk with the specified miniblock_size:
if (miniblock_size > 1) {
D768$trialMblk <- convert_trialNum_to_trialMblk(D768$trialNum, miniblock_size)
} else {
stopifnot(miniblock_size == 1)
# convert_trialNum_to_trialMblk doesn't handle this special case very well
D768$trialMblk <- D768$trialNum
}
# Average the clipped RTs within each miniblock:
D <- ddply(D768, .(trialMblk), function(D1) {
data.frame(
subjNum = constant(D1$subjNum, verifyp = FALSE),
session = constant(D1$session, verifyp = FALSE),
t = constant(D1$trialMblk, verifyp = FALSE),
RT = mean(D1$clippedRT),
N_obs = nrow(D1)
)}) # end of ddply(...)
# Sanity checks:
stopifnot(nrow(D) == PAR$N_trials_per_session / miniblock_size)
stopifnot(constant(D$N_obs) == miniblock_size)
# Return the resulting data frame:
D
} # extract_and_smooth_indiv_RTs
We store the residual squared error (RSE) and other results from these calculations in a new data frame model_fits:
Fix a miniblock_size parameter and fit 30 linear models + 30
exponential models. There is a hyper-parameter
PAR$miniblock_size_nls
(aka mb_sz
). We ran the
script below multiple times for mb_sz in c(1, 2, 4, 6, 8)
.
The results are summarized in the pre-formatted block of text below. The
overall conclusion of these trial-and-error explorations is that the
best value is PAR$miniblock_size_nls == 4
. Accordingly, it
is set as the default in the “production” code below.
I’ve added a few “magic” special cases to customize
mb_sz
for certain individual cases. With these
customizations, nls()
converges for the 11 of the 15 data
sets in the video session and for the 7 of the 15 data sets in the
linguistic session.
## Pick a value for PAR$miniblock_size_nls:
PAR$miniblock_size_nls <- 4
## For each <subj, session> pair, fit both models:
for (k in seq(1, nrow(model_fits))) {
## Determine customized smoothing parameter:
curr_subjNum <- as.character(model_fits$subjNum[[k]]) # e.g., "10889"
curr_session <- as.character(model_fits$session[[k]]) # "video" or "ling"
curr_miniblock_size <- PAR$miniblock_size_nls # UNLESS ...
# ... UNLESS we are in one of the following "magic" cases.
# The customized mb_sz values were determined by trial and error:
if (curr_subjNum == "69393") { curr_miniblock_size <- 1 } # for both sessions
if (curr_subjNum == "80116" & curr_session == "video") {
curr_miniblock_size <- 2 }
if (curr_subjNum == "30474" & curr_session == "ling") {
curr_miniblock_size <- 1 }
if (curr_subjNum == "10889" & curr_session == "ling") {
curr_miniblock_size <- 6 }
# store the customized smoothing parameter the data frame:
model_fits$mb_sz[[k]] <- curr_miniblock_size
## Extract the relevant data
D <- extract_and_smooth_indiv_RTs(curr_subjNum, curr_session, curr_miniblock_size)
## Fit the linear model. It always converges
linM <- lm(RT ~ t, D) # an object of class "lm"
linS <- summary(linM) # an object of class "summary.lm"
## Store information about the linear fit in the data frame:
model_fits$lin_RSE[[k]] <- sigma(linM)
model_fits$lin_intcpt[[k]] <- coef(linS)[["(Intercept)", "Estimate"]]
model_fits$lin_intcpt_se[[k]] <- coef(linS)[["(Intercept)", "Std. Error"]]
model_fits$lin_slope[[k]] <- coef(linS)[["t", "Estimate"]]
model_fits$lin_slope_se[[k]] <- coef(linS)[["t", "Std. Error"]]
## Fit the exponential model. It may NOT converge!
# Construct an initial guess for the three parameters:
theta0 <- find_RT_at_reference_trials(D$t, D$RT)
theta0$tau <- PAR$guess_tau
# Call nls() and check for convergence. expM is an object of class "nls"
expM <- nls(M$nls_formula, D, theta0, control= list(warnOnly = TRUE))
if (expM$convInfo$isConv) {
# The exponential model converged -- store the parameter estimates:
model_fits$nls_isConv[[k]] <- TRUE
model_fits$exp_RSE[[k]] <- sigma(expM)
expS <- summary(expM) # an object of class "summary.nls"
model_fits$exp_y192[[k]] <- coef(expS)[["y192", "Estimate"]]
model_fits$exp_y192_se[[k]] <- coef(expS)[["y192", "Std. Error"]]
model_fits$exp_y576[[k]] <- coef(expS)[["y576", "Estimate"]]
model_fits$exp_y576_se[[k]] <- coef(expS)[["y576", "Std. Error"]]
model_fits$exp_tau[[k]] <- coef(expS)[["tau", "Estimate"]]
model_fits$exp_tau_se[[k]] <- coef(expS)[["tau", "Std. Error"]]
} else {
# The exponential model did not converge -- do nothing:
# The data frame is initialized with correct values (mostly NAs).
expS <- NULL # erase any leftovers from the previous iteration, just in case
} # end if (expM$convInfo$isConv)
} # for k
## Results of fitting the models with PAR$miniblock_size_nls == 4:
##
## In the video session, nls() converged for these 11 participants:
## subjNum session mb_sz lin_RSE exp_RSE exp_tau
## 5 30474 video 4 82.94 77.96 111.11
## 7 30723 video 4 85.13 80.63 85.00
## 9 46991 video 4 96.78 94.67 210.19
## 11 52502 video 4 94.80 91.19 437.78
## 13 53927 video 4 65.10 58.05 24.48
## 15 55464 video 4 107.37 89.06 73.39
## 17 68030 video 4 78.64 78.42 300.24
## 19 69393 video 1 195.87 192.87 74.30
## 21 71440 video 4 98.72 76.06 31.69
## 25 80116 video 2 131.51 125.91 119.83
## 27 87949 video 4 90.71 76.14 190.94
##
## In the ling session, nls() converged for these 7 participants:
## subjNum session mb_sz lin_RSE exp_RSE exp_tau
## 2 10889 ling 6 26.00 25.99 -42.88
## 6 30474 ling 1 86.99 86.70 60.86
## 18 68030 ling 4 41.37 38.83 218.74
## 20 69393 ling 1 204.69 204.81 907.50
## 22 71440 ling 4 49.05 49.13 349.48
## 26 80116 ling 4 107.86 107.88 508.99
## 28 87949 ling 4 56.34 54.60 128.90
## Remove temporary variables from the workspace:
rm("k", "curr_subjNum", "curr_session", "curr_miniblock_size",
"D", "theta0", "linM", "linS", "expM", "expS")
Experiments with various values of PAR$miniblock_size_nls
produced the following results. These trial-and-error runs were used to
determine the “magic” customized values for the miniblock_size smoothing
parameter in the preceding code cell.
################################################################## Results of fitting the models with PAR$miniblock_size_nls == 1: In the video session, nls() converged for these 8 participants: subjNum session lin_RSE exp_RSE exp_tau 5 30474 video 135 132 111.8 9 46991 video 184 183 209.5 11 52502 video 153 151 439.5 15 55464 video 171 160 72.1 17 68030 video 159 159 295.9 19 69393 video 196 193 74.3 # <-- unstable 25 80116 video 172 167 119.4 27 87949 video 150 142 190.4 In the ling session, nls() converged for these 5 participants: subjNum session lin_RSE exp_RSE exp_tau 6 30474 ling 87.0 86.7 60.9 # <-- unstable 18 68030 ling 76.9 75.6 219.5 20 69393 ling 204.7 204.8 907.5 # <-- unstable 22 71440 ling 94.3 94.4 349.6 28 87949 ling 97.0 95.9 129.6 ################################################################## Results of fitting the models with PAR$miniblock_size_nls == 2: In the video session, nls() converged for these 8 participants: subjNum session lin_RSE exp_RSE exp_tau 5 30474 video 106.2 102.4 112.03 7 30723 video 125.4 122.4 83.09 # <-- new 9 46991 video 137.0 135.5 209.68 11 52502 video 116.6 113.6 439.81 15 55464 video 135.2 121.1 71.62 17 68030 video 109.7 109.5 297.61 25 80116 video 131.5 125.9 119.83 27 87949 video 116.0 105.0 190.89 In the ling session, nls() converged for these 5 participants: subjNum session lin_RSE exp_RSE exp_tau 12 52502 ling 126.68 124.48 -22.89 # <-- new 18 68030 ling 55.45 53.60 219.28 22 71440 ling 70.16 70.21 350.28 26 80116 ling 126.25 126.19 509.81 # <-- new 28 87949 ling 71.18 69.80 130.18 ################################################################## Results of fitting the models with PAR$miniblock_size_nls == 4: In the video session, nls() converged for these 10 participants: subjNum session lin_RSE exp_RSE exp_tau 5 30474 video 82.94 77.96 111.11 7 30723 video 85.13 80.63 85.00 9 46991 video 96.78 94.67 210.19 11 52502 video 94.80 91.19 437.78 13 53927 video 65.10 58.05 24.48 # <-- new 15 55464 video 107.37 89.06 73.39 17 68030 video 78.64 78.42 300.24 21 71440 video 98.72 76.06 31.69 # <-- new 25 80116 video 113.93 125.91 119.83 # <-- unstable 27 87949 video 90.71 76.14 190.94 In the ling session, nls() converged for these 5 participants: subjNum session lin_RSE exp_RSE exp_tau 12 52502 ling 97.21 124.48 -22.89 18 68030 ling 41.37 38.83 218.74 22 71440 ling 49.05 49.13 349.48 26 80116 ling 107.86 107.88 508.99 28 87949 ling 56.34 54.60 128.90 ################################################################## Results of fitting the models with PAR$miniblock_size_nls == 6: In the video session, nls() converged for these 8 participants: subjNum session mb_sz lin_RSE exp_RSE exp_tau 7 30723 video 6 75.62 70.40 82.76 9 46991 video 6 82.28 79.77 209.26 11 52502 video 6 83.99 79.97 440.28 15 55464 video 6 98.68 78.38 74.19 17 68030 video 6 62.17 61.86 297.47 19 69393 video 6 98.97 92.99 75.47 25 80116 video 6 95.17 87.31 119.59 27 87949 video 6 78.22 60.73 191.63 In the ling session, nls() converged for these 5 participants: subjNum session mb_sz lin_RSE exp_RSE exp_tau 2 10889 ling 6 26.00 25.99 -42.88 # <-- new 18 68030 ling 6 37.23 34.46 221.66 22 71440 ling 6 39.12 39.21 344.23 26 80116 ling 6 92.00 92.05 511.73 28 87949 ling 6 46.22 44.03 126.67 ################################################################## Running the same script with PAR$miniblock_size_nls == 8 produced a fatal error in one of the nls() attempts. ################################################################## ******* Final Results of the Trial-and-Error Runs ******* Running the same script with CUSTOMIZED miniblock_size produced the following results, 2024-05-08: In the video session, nls() converged for these 11 participants: subjNum session mb_sz lin_RSE exp_RSE exp_tau 5 30474 video 4 82.94 77.96 111.11 7 30723 video 4 85.13 80.63 85.00 9 46991 video 4 96.78 94.67 210.19 11 52502 video 4 94.80 91.19 437.78 13 53927 video 4 65.10 58.05 24.48 15 55464 video 4 107.37 89.06 73.39 17 68030 video 4 78.64 78.42 300.24 # will be rejected by AIC 19 69393 video 1 195.87 192.87 74.30 21 71440 video 4 98.72 76.06 31.69 25 80116 video 2 131.51 125.91 119.83 27 87949 video 4 90.71 76.14 190.94 In the ling session, nls() converged for these 7 participants: subjNum session mb_sz lin_RSE exp_RSE exp_tau 2 10889 ling 6 26.00 25.99 -42.88 # negative tau makes no sense 6 30474 ling 1 86.99 86.70 60.86 18 68030 ling 4 41.37 38.83 218.74 20 69393 ling 1 204.69 204.81 907.50 # exp_RSE > lin_RSE 22 71440 ling 4 49.05 49.13 349.48 # exp_RSE > lin_RSE 26 80116 ling 4 107.86 107.88 508.99 # exp_RSE > lin_RSE 28 87949 ling 4 56.34 54.60 128.90 ##################################################################
We are going to use the Akaike information criterion to select among
the two competing models in each case. Theoretically, the exponential
model subsumes the linear model as a special case when the time constant
tau grows very large. Thus, the exponential model should always achieve
better fit. This is not true in practice because of the vagaries of the
iterative optimization algorithm within nls()
. But if we
assume that nls()
did minimize the residual error (RSE),
then it should always be the case that
exp_RSE <= lin_RSE
. The question is whether this drop
can be attributed to overfitting from of the extra free parameter
exp_tau
.
The Akaike information criterion can be calculated from the
least-squares solution by the formula:
delta_AIC = 2*k + n*log(RSE^2) = 2*k + 2*n*log(RSE)
, where
n is the number of data points used to fit the models.
Reference: Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference: A practical information-theoretic approach (2nd ed.), p. 63, cited in https://en.wikipedia.org/wiki/Akaike_information_criterion.
The model with minimum AIC is preferred. The following
helper function calculates the difference
AIC(lin_model) - AIC(exp_model)
. Thus, positive values
favor the exponential model:
## Akaike information criterion:
delta_AIC <- function(lin_RSE, exp_RSE, N_obs) {
# The linear model has 2 parameters:
AIC_lin <- 2*2 + 2*N_obs*log(lin_RSE)
# The exponential model has 3 parameters:
AIC_exp <- 2*3 + 2*N_obs*log(exp_RSE)
# Return the difference. Positive values favor the exponential model
AIC_lin - AIC_exp
}
We are set to perform the actual model comparisons:
## Add a new column -- lcType, short for "learning-curve model type":
model_fits$lcType <- NA
## The linear model wins by default whenever nls() failed to converge:
idx <- !model_fits$nls_isConv # a vector of logical values
model_fits$lcType[idx] <- "lin"
## The linear model also wins when the time-constant estimate is negative:
idx <- (model_fits$nls_isConv & (model_fits$exp_tau <= 0))
model_fits$lcType[idx] <- "lin"
## The linear model also wins whenever lin_RSE <= exp_RSE:
# (This happens when nls() failed to find the true optimal solution for
# the non-linear model.)
idx <- (model_fits$nls_isConv & (model_fits$lin_RSE <= model_fits$exp_RSE))
model_fits$lcType[idx] <- "lin"
## At this point, only three data sets are left standing in the linguistic
## condition -- indices 6, 18, and 28 -- subjNum 30474, 68030, and 87949.
## In the video condition, all 11 cases for which nls() converged are still
## standing. We appeal to AIC to resolve these non-trivial cases:
## Add yet another column to model_fits:
model_fits$delta_AIC <- NA
idx <- model_fits$nls_isConv
model_fits$delta_AIC[idx] <- delta_AIC(model_fits$lin_RSE[idx],
model_fits$exp_RSE[idx],
PAR$N_trials_per_session/model_fits$mb_sz[idx])
## Positive values of delta_AIC favor the exponential model:
idx <- is.na(model_fits$lcType) # the cases that are still not resolved
model_fits$lcType[idx] <- ifelse(model_fits$delta_AIC[idx] > 0,
"exp", # AIC favors the exponential model
"lin") # AIC favors the linear model
## The AIC criterion favored "lin" in only one of the contested cases --
## that for participant 68030 in the video condition (idx==17).
## After all said and done, 10 of the 15 cases in the video condition have
## lcType=="exp", as contrasted to only 3 cases in the linguistic condition.
## This is reflected in the following table:
with(model_fits, table(session, lcType))
## lcType
## session exp lin
## video 10 5
## ling 3 12
The following function is a wrapper around exp_model()
defined earlier. It takes a single argument – an integer row-index into
model_fits. It returns a function of a single argument, t, that can be
applied to data.
## Take an integer index into model_fits and return function of t:
indiv_exp_model <- function(k, params = model_fits) {
## Retrieve the relevant parameters:
stopifnot(params$nls_isConv[[k]]) # sanity check
y192 <- params$exp_y192[[k]]
y576 <- params$exp_y576[[k]]
tau <- params$exp_tau[[k]]
## Construct and return a function of a single argument:
function(t)
exp_model(t, y192, y576, tau)
} # indiv_exp_model
## Test case illustrating typical usage:
k <- 5
D <- subset(SR1_data, subjNum == model_fits$subjNum[[k]] & session == model_fits$session[[k]])
indivF <- indiv_exp_model(k)
D$RT_lcurve <- indivF(D$trialNum)
## Plot the fit:
print(ggplot(D, aes(x = trialNum, y = clippedRT)) + geom_point(color="#AAAAAA") +
geom_line(aes(y = RT_lcurve), color="red"))
Finally we are in a position to estimate the learning curve for each individual data set.
## We'll add a few columns to the big data frame, to be populated in the loop below:
SR1_data$lcType <- NA
SR1_data$lcRSE <- NA
SR1_data$exp_tau <- NA
SR1_data$delta_AIC <- NA
SR1_data$RT_lcurve <- NA # Sic!
SR1_data$residRT <- NA # Sic!
## For each <subj, session> pair, calculate RT_lcurve from the appropriate model:
for (k in seq(1, nrow(model_fits))) {
## What are we dealing with?
curr_subjNum <- as.character(model_fits$subjNum[[k]]) # e.g., "10889"
curr_session <- as.character(model_fits$session[[k]]) # "video" or "ling"
curr_lcType <- model_fits$lcType[[k]] # "lin" or "exp"
## A logical index to the relevant rows in the big data frame:
idx <- (SR1_data$subjNum == curr_subjNum & SR1_data$session == curr_session)
stopifnot(sum(idx) == PAR$N_trials_per_session)
## Bookkeeping:
SR1_data$lcType[idx] <- curr_lcType
SR1_data$exp_tau[idx] <- model_fits$exp_tau[[k]]
SR1_data$delta_AIC[idx] <- model_fits$delta_AIC[[k]]
## Extract the relevant data:
D <- data.frame(t = SR1_data$trialNum[idx], RT = SR1_data$clippedRT[idx])
stopifnot(nrow(D) == PAR$N_trials_per_session)
## Branch depending on model type
if (curr_lcType == "lin") {
# Fit a linear model:
linM <- lm(RT ~ t, D) # an object of class "lm"
# Store the fitted values in the main data frame:
SR1_data$RT_lcurve[idx] <- fitted(linM) # RELIES ON IMPLICIT ORDER
# Identify the model_RSE for later sanity check:
curr_model_RSE <- model_fits$lin_RSE[[k]] # cf sigma(linM)
} else {
stopifnot(curr_lcType == "exp")
# Make a function that implements the individualized exponential model:
indivF <- indiv_exp_model(k)
# Apply this function to the data and store the fitted values:
SR1_data$RT_lcurve[idx] <- indivF(D$t)
# Identify the model_RSE for later sanity check:
curr_model_RSE <- model_fits$exp_RSE[[k]]
} # if curr_lcType
## It is time to reap the fruits of all these efforts -- calculate residRT:
SR1_data$residRT[idx] <- SR1_data$clippedRT[idx] - SR1_data$RT_lcurve[idx]
## Calculate the residual standard error (RSE) and store it in the data frame:
SR1_data$lcRSE[idx] <- sd(SR1_data$residRT[idx])
# Sanity check: lcRSE should not be too far off curr_model_RSE
curr_model_RSE <- curr_model_RSE * sqrt(model_fits$mb_sz[[k]])
cat(sprintf(
"k=%2d, subj=%s, session=%5s, type=%s, lcRSE = %7.3f, modelRSE = %7.3f \n",
k, curr_subjNum, curr_session, curr_lcType,
constant(SR1_data$lcRSE[idx], verifyp = FALSE),
curr_model_RSE))
} # for k
## k= 1, subj=10889, session=video, type=lin, lcRSE = 143.271, modelRSE = 167.020
## k= 2, subj=10889, session= ling, type=lin, lcRSE = 58.970, modelRSE = 63.675
## k= 3, subj=19616, session=video, type=lin, lcRSE = 170.608, modelRSE = 182.513
## k= 4, subj=19616, session= ling, type=lin, lcRSE = 78.907, modelRSE = 82.413
## k= 5, subj=30474, session=video, type=exp, lcRSE = 131.403, modelRSE = 155.921
## k= 6, subj=30474, session= ling, type=exp, lcRSE = 86.592, modelRSE = 86.705
## k= 7, subj=30723, session=video, type=exp, lcRSE = 151.232, modelRSE = 161.251
## k= 8, subj=30723, session= ling, type=lin, lcRSE = 82.618, modelRSE = 98.450
## k= 9, subj=46991, session=video, type=exp, lcRSE = 183.007, modelRSE = 189.349
## k=10, subj=46991, session= ling, type=lin, lcRSE = 95.207, modelRSE = 105.815
## k=11, subj=52502, session=video, type=exp, lcRSE = 150.765, modelRSE = 182.379
## k=12, subj=52502, session= ling, type=lin, lcRSE = 174.309, modelRSE = 194.418
## k=13, subj=53927, session=video, type=exp, lcRSE = 112.004, modelRSE = 116.095
## k=14, subj=53927, session= ling, type=lin, lcRSE = 53.342, modelRSE = 49.396
## k=15, subj=55464, session=video, type=exp, lcRSE = 159.731, modelRSE = 178.119
## k=16, subj=55464, session= ling, type=lin, lcRSE = 180.733, modelRSE = 187.157
## k=17, subj=68030, session=video, type=lin, lcRSE = 158.757, modelRSE = 157.273
## k=18, subj=68030, session= ling, type=exp, lcRSE = 75.488, modelRSE = 77.667
## k=19, subj=69393, session=video, type=exp, lcRSE = 192.614, modelRSE = 192.866
## k=20, subj=69393, session= ling, type=lin, lcRSE = 204.555, modelRSE = 204.689
## k=21, subj=71440, session=video, type=exp, lcRSE = 143.482, modelRSE = 152.119
## k=22, subj=71440, session= ling, type=lin, lcRSE = 94.287, modelRSE = 98.108
## k=23, subj=79802, session=video, type=lin, lcRSE = 219.227, modelRSE = 257.525
## k=24, subj=79802, session= ling, type=lin, lcRSE = 203.235, modelRSE = 232.498
## k=25, subj=80116, session=video, type=exp, lcRSE = 167.282, modelRSE = 178.062
## k=26, subj=80116, session= ling, type=lin, lcRSE = 160.596, modelRSE = 215.719
## k=27, subj=87949, session=video, type=exp, lcRSE = 141.772, modelRSE = 152.277
## k=28, subj=87949, session= ling, type=exp, lcRSE = 95.824, modelRSE = 109.201
## k=29, subj=98309, session=video, type=lin, lcRSE = 245.119, modelRSE = 282.750
## k=30, subj=98309, session= ling, type=lin, lcRSE = 108.740, modelRSE = 124.441
## Sanity checks:
stopifnot(all(!is.na(SR1_data$lcType)))
stopifnot(all(!is.na(SR1_data$lcRSE)))
stopifnot(all(!is.na(SR1_data$RT_lcurve)))
stopifnot(all(!is.na(SR1_data$residRT)))
## Convert SR1_data$lcType to factor:
SR1_data$lcType <- factor(SR1_data$lcType)
## Set undefined exp_tau values to 9999.9. In all cases when exp_tau is not
# defined, we used the linear model by default. Theoretically, the linear
# model is a special case of the exponental one when the time constant
# tau gets very large. 9999.9 is two orders of magnitude larger than any
# of the valid exp_tau estimates:
idx <- is.na(SR1_data$exp_tau)
SR1_data$exp_tau[idx] <- 9999.9
stopifnot(all(!is.na(SR1_data$exp_tau)))
SR1_data$delta_AIC[idx] <- -9999.9 # negative AIC favor the linear model
stopifnot(all(!is.na(SR1_data$delta_AIC)))
## Descriptive statistics:
summary(SR1_data$residRT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -590.5877 -60.3483 -21.6858 0.0699 18.9305 1635.7277
## Remove temporary variables from the workspace:
rm("k", "curr_subjNum", "curr_session", "curr_lcType", "curr_model_RSE",
"idx", "D", "linM", "indivF")
Re-create the individual scatter plots from an earlier section, but this time superimpose the corresponding learning-curve fits:
## Re-create SR1_indivRT_Mblk, this time adding a column for RT_lcurve:
# (See the R cell named "scatter_by_sbj1" around line 430 above.)
SR1_indivRT_Mblk <- NULL # remove the old data frame to prevent interference
SR1_indivRT_Mblk <- ddply(SR1_data, .(subjNum, session, trialMblk), function(D1) {
data.frame(
RT_mean = mean(D1$clippedRT),
RT_lcurve = mean(D1$RT_lcurve), # <-- new variable
N_obs = nrow(D1) # number of observations
)})
## Sanity check:
stopifnot(constant(SR1_indivRT_Mblk$N_obs) == PAR$miniblock_size1)
## Alternative time variable -- useful for plotting sessions side-by-side:
SR1_indivRT_Mblk$trialMblk2 <- (with (SR1_indivRT_Mblk,
convert_trialNum_to_trialNum2(trialMblk, session)))
## Scatter-plot the miniblock-averaged RTs separately for each individual:
gp$RT_indiv_scatter2 <- ggplot(SR1_indivRT_Mblk,
aes(x = trialMblk2, y = RT_mean, shape=session)) +
geom_point(color="#AAAAAA") + facet_wrap(vars(subjNum)) +
geom_line(aes(y = RT_lcurve, color=session), linewidth=1)
print(gp$RT_indiv_scatter2)
## Plot the residuals:
SR1_indivRT_Mblk$residRT <- SR1_indivRT_Mblk$RT_mean - SR1_indivRT_Mblk$RT_lcurve
gp$RT_indiv_residuals2 <- ggplot(SR1_indivRT_Mblk,
aes(x = trialMblk2, y = residRT, shape=session, color=session)) +
geom_point() + facet_wrap(vars(subjNum))
print(gp$RT_indiv_residuals2)
These plots demonstrate that our learning-curve fits do pass through the middle of their respective clouds of data points.
Note that in 4 of the 30 cases the learning curves go slightly upward. I’m not sure whether this is a cause for concern or not. Apparently, in addition to the learning effects there are also fatigue effects, particularly on Day 2.
Let’s replot the group-averaged figure too. It is a candidate for a figure in the CogSci paper.
## Re-create SR1_groupRT_data, this time adding a column for RT_lcurve:
# (See the R cell named "group_RT_data1" around line 130 above.)
SR1_groupRT_data <- NULL # remove the old data frame to prevent interference
SR1_groupRT_data <- ddply(SR1_data, .(session, trialNum), function(D1) {
data.frame(
RT_mean = mean(D1$clippedRT),
RT_lcurve = mean(D1$RT_lcurve), # <-- new variable
N_sbj = nrow(D1)
)})
stopifnot(is.constant(SR1_groupRT_data$N_sbj))
## Define new time variables: (This just repeats what we did earlier.)
SR1_groupRT_data$trialNum2 <- (with (SR1_groupRT_data,
convert_trialNum_to_trialNum2(trialNum, session)))
SR1_groupRT_data$trialMblk <- convert_trialNum_to_trialMblk(SR1_groupRT_data$trialNum)
SR1_groupRT_data$trialMblk2 <- (with (SR1_groupRT_data,
convert_trialNum_to_trialNum2(trialMblk, session)))
## Re-create SR1_groupRT_Mblk, this time adding a column for RT_lcurve:
# (See the R cell named "groupRT_Mblk1" around line 180 above.)
SR1_groupRT_Mblk <- NULL # remove the old data frame to prevent interference
SR1_groupRT_Mblk <- ddply(SR1_groupRT_data, .(session, trialMblk), function(D1) {
data.frame(
trialMblk2 = constant(D1$trialMblk2),
RT_mean = mean(D1$RT_mean),
RT_lcurve = mean(D1$RT_lcurve), # <-- new variable
N_obs = nrow(D1) * constant(D1$N_sbj, verifyp=FALSE) # number of observations
)})
## Sanity check: The number of observations in each miniblock should be
# 120 = 15 subjects * 8 trials/miniblock = const
stopifnot(constant(SR1_groupRT_Mblk$N_obs) == PAR$N_sbj * PAR$miniblock_size)
## Scatter-plot the group-averaged RTs:
gp$RT_mean_scatter4 <- ggplot(SR1_groupRT_Mblk,
aes(x = trialMblk2, y = RT_mean, shape=session)) +
geom_point(color="#AAAAAA") +
geom_line(aes(y = RT_lcurve, color=session), linewidth=1)
print(gp$RT_mean_scatter4)
## Plot the residuals:
SR1_groupRT_Mblk$residRT <- SR1_groupRT_Mblk$RT_mean - SR1_groupRT_Mblk$RT_lcurve
gp$RT_group_residuals2 <- ggplot(SR1_groupRT_Mblk,
aes(x = trialMblk2, y = residRT, shape=session, color=session)) +
geom_point()
print(gp$RT_group_residuals2)
“./SR1_get_data2.Rmd” defined two sets of trial-level exclusion
criteria. They are flagged in CR1_data$filter1
and
filter2
.
The first set of criteria (filter1
) are uncontroversial.
They are applied in the next cell. The second set, however, involve the
z-transformed RTs based on the raw RT data. It is useful to re-calculate
these z scores now that we are using residual RTs instead of
raw RTs.
The first filter is applied here:
## Apply the FIRST (but not the SECOND) set of trial-level exclusion criteria:
D <- subset(SR1_data, filter1) # Note: filter1 rather than filter2
## Sanity checks:
stopifnot(D$timeout==0) # trials with RT>2000 have been excluded
stopifnot(D$acc==1) # trials with incorrect responses have been excluded
stopifnot(D$filter1)
## Remove columns that have constant values:
D <- D[, setdiff(names(D), c("timeout", "acc", "filter1"))]
## Print a message on the console:
cat(sprintf("Extracted a dataframe D with %d rows and %d columns.\n",
nrow(D), ncol(D)))
## Extracted a dataframe D with 22373 rows and 37 columns.
The second filter (filter2
) is based exclusively on
z-scores of the raw RT data. Trials with
abs(zRT) > meta_data$zRT_threshold
were excluded from
the sample. “./SR1_get_data2.Rmd” used a threshold of 3.0, which
eliminated 395 trials, leaving 21978 “good” ones.
Let’s apply the same general idea to the residual RT data. To that end, we replicate the processing from lines 420 to ~500 in “./SR1_get_data2.Rmd”:
## Calculate rRT_stats and store them in a dataframe:
rRT_stats <- ddply(D, .(subjNum, session), function(D1) {
data.frame(
subjNum = constant(D1$subjNum, verifyp=FALSE),
session = constant(D1$session, verifyp=FALSE),
rRT_mean = mean(D1$residRT), # rather than mean(D1$residRT)
rRT_std = sd(D1$residRT) # rather than sd(D1$RT)
)})
print(rRT_stats, digits = 3) # in milliseconds from static-image onset
## subjNum session rRT_mean rRT_std
## 1 10889 video -1.3050 130.6
## 2 10889 ling 2.9122 58.3
## 3 19616 video -9.3284 136.4
## 4 19616 ling 0.3976 79.0
## 5 30474 video -3.3599 115.2
## 6 30474 ling 0.0552 86.6
## 7 30723 video -7.3819 102.5
## 8 30723 ling -1.4304 58.2
## 9 46991 video -9.8203 127.6
## 10 46991 ling 2.5577 93.6
## 11 52502 video 0.7535 132.2
## 12 52502 ling -13.5553 111.5
## 13 53927 video 1.8614 95.6
## 14 53927 ling 1.1825 53.4
## 15 55464 video -4.1095 144.5
## 16 55464 ling -15.5875 104.7
## 17 68030 video -3.0021 147.7
## 18 68030 ling -0.3650 74.9
## 19 69393 video -0.2688 147.8
## 20 69393 ling -10.0805 149.3
## 21 71440 video 2.7865 106.9
## 22 71440 ling -1.3914 73.5
## 23 79802 video -3.9431 172.3
## 24 79802 ling -6.9834 172.8
## 25 80116 video -5.2667 140.9
## 26 80116 ling -1.0383 154.2
## 27 87949 video -4.1867 123.6
## 28 87949 ling -0.0156 95.9
## 29 98309 video -14.1834 202.0
## 30 98309 ling -1.9475 91.4
The means are relatively small because we are averaging residuals around the main trend as captured by the individual learning curves.
Also, most means are negative. This is because filter1 eliminated disproportionately more trials with long RTs rather than trials with short ones.
We need this information at the trial level. In the language of relational databases, we need to perform a JOIN operation using the pair (subjNum, session) as key to link the two databases. Conveniently, the plyr package implements this functionality. /* We could also use merge() from the base package but it tends to mess up the order of the rows in the resulting dataframe. */
SR1_data <- join(SR1_data, rRT_stats, by = c("subjNum", "session"))
We calculate the studentized RT and store it as a new variable “zrRT” (z-transformed residual RT).
We then define our third filter for excluding trials – we keep only
the trials whose ab(zrRT) is less than
meta_data$zRT_threshold
.
The final filter is stored as variable “filter3” in the dataframe. It subsumes filter1 but not filter2.
## Add a new column to the dataframe: zrRT = studentize(residRT)
SR1_data$zrRT <- with(SR1_data, ((residRT - rRT_mean) / rRT_std))
## Add one final column: filter3
SR1_data$filter3 <- with(SR1_data, (filter1 & (abs(zrRT) <= meta_data$zRT_threshold)))
## Tabulate how many trials satisfy the three filters.
# filter2 subsumes filter1:
with(SR1_data, table(filter1, filter2, useNA="ifany"))
## filter2
## filter1 FALSE TRUE
## FALSE 667 0
## TRUE 395 21978
# filter3 also subsumes filter1:
with(SR1_data, table(filter1, filter3, useNA="ifany"))
## filter3
## filter1 FALSE TRUE
## FALSE 667 0
## TRUE 400 21973
# Importantly, filter2 and filter3 are strongly but not perfectly correlated:
with(SR1_data, table(filter2, filter3, useNA="ifany"))
## filter3
## filter2 FALSE TRUE
## FALSE 1027 35
## TRUE 40 21938
## Sanity check: The filtered zrRT distribution should lie entirely in [-3.0, +3.0]
summary(subset(SR1_data, filter3)$zrRT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.87300 -0.49651 -0.18535 -0.09161 0.17990 2.99125
Very interestingly, there are 40 cases that are passed by filter2 but blocked by filter3. These are points that are too far away from the learning curves, while simultaneously “hiding” in the bulk of the distribution.
Conversely, there are 35 cases that are passed by filter 3 but blocked by filter2. These are points – typically early during the video session – with large raw RTs that are accounted for by the upward bend of the learning curves.
We strongly recommend to deprecate filter2 and use filter3 instead. Thus, as of May 2024, the recommended method for retrieving the “good” data is by the following code:
## Recommended method for accessing the "good" trials for substantive analyses:
D <- subset(SR1_data, filter3)
Save the variables “SR1_data”, “SR1_sbj”, “meta_data”, and “model_fits” as a RData file. Note that this supercedes the similar RData file produced by “./SR1_get_data2.Rmd”.
## Remove the variable "trialMblk" from the main data frame. It is easy to
# recompute whenever it's needed, with the added flexibility of choosing
# the miniblock size.
SR1_data$trialMblk <- NULL
## Copy PAR$ref_trials to meta_data. It is an arbitrary choice related to the
# exponential-curve fits. (This choice shouldn't matter, but just in case.)
meta_data$exp_ref_trials <- PAR$ref_trials
## Pathnames
meta_data$main_data2.RData <- "SR1_main_data2.RData"
full_filename <- file.path(meta_data$SeeRel1_root, meta_data$data_path,
meta_data$main_data2.RData)
## Save the data to disk. Print a message on the console:
save(list = c("SR1_data", "SR1_sbj", "meta_data", "model_fits"),
file = full_filename)
cat(sprintf("Saved a dataframe with %d rows and %d columns in %s.\n",
nrow(SR1_data), ncol(SR1_data), full_filename))
## Saved a dataframe with 23040 rows and 43 columns in /Users/petrov.11/work/RelExper/SeeRel1/data/SR1_main_data2.RData.
## Tidy up the workspace
rm("full_filename")
Where to go from here: First, check out “./SR1_exploratory2.Rmd”. The publication-quality analyzes and figures will be produced by “./SR1_CogSci2024_final.Rmd”
End of script “SR1_learning_effect1.Rmd”.