SR1_learning_effect1.Rmd

Executive Summary

This script examines the learning effects in the “SeeRel1” (or “SR1”) experiment. It also adds 11 new columns to the data frame with the main data as documented at the end of this preamble. (See earlier scripts “./SR1_exploratory1.Rmd” and “./SR1_CogSci2024.Rmd” for more information about the experimental design, working hypotheses, etc.)

We submitted a six-page report of the SR1 experiment to the Cognitive Science 2024 conference. (See “…papers/conferences/2024/CogSci2024/submit1/”). All four reviewers pointed out learning effects as the major methodological concern that undermines the interpretability of our results. Indeed, a major flaw of our design is that the video session always occurs on Day 1 and the linguistic session on Day 2. This makes it hard to interpret the lack of priming effects in the linguistic condition. A deflationary alternative interpretation is that it is a null result due to over-training.

To remedy this problem, this script estimates the effects of practice in an attempt to partial them out statistically and thereby improve the interpretability of the main analysis. The main idea is to conduct the substantive analyses not on the raw RT data but on the residuals of a non-linear regression of RT on training time. The present script calculates these residuals and stores them in the data frame under the name “DRT” (short for “detrended response time”). A follow-up script (“./SR1_exploratory2.Rmd”) will analyze these DRTs as a function of relational role assignment, etc.

The present script loads a dataframe called SR1_data generated by “./SR1_get_data2.Rmd”. Data from the 15 participants who pass the exclusion criteria are concatenated together. Each row in the dataframe represents one experimental trial. The variables (columns) are documented in “./SR1_get_data2.Rmd”.

The present script creates a new RData file that contains an augmented version of the same dataframe. The first 32 columns in the augmented dataframe are identical with the columns in the original data set that was loaded from “../data/SR1_main_data.RData”.
Here we defines following 11 new columns:
c(“clippedRT”, “lcType”, “lcRSE”, “exp_tau”, “delta_AIC”, “RT_lcurve”, “residRT”)

For documentation of variables 1 through 32, see the preamble of “./SR1_get_data2.Rmd”. The new variables are as follows:

clippedRT RT_mean clipped between 200 and 2000ms (see meta_data$RT_limits)
lcType Learning-curve type – factor with 2 levels: “lin” and “exp”
lcRSE Residual standard error of the learning-curve fit, separately for each participant in each session
exp_tau When lcType==“exp”, the time-constant parameter of the fit
delta_AIC Akaike information criterion used for selecting linear or exponential learning-curve models. Positive values favor the exponential model.
RT_lcurve Best-fitting (in the least-squares sense) learning curve for a given participant in a give session.
residRT = RT_mean - RT_lcurve. Use this as the primary dependent variable for all subsequent RT analyses. This mitigates the confound that the linguistic session was always on Day 2.
rRT_mean: Mean of residualRT across the trials that satisfy filter1, separately for each subject on each session
rRT_std: residualRT standard deviation – complement to rRT_mean above
zrRT: Studentized residualRT – that is: zrRT = (residRT - rRT_mean) / rRT_std
filter3: Whether the trial satisfies filter1 and also has abs(zrRT)<=3.0. Use filter3 instead of filter2

This RData file is loaded by the scripts that perform the substantive analyses in May 2024. We recommend that most substantive analyses are performed on the subset of SR1_data for which filter3==TRUE. (Note: filter THREE, not filter TWO.) To access the “good” data, use the following code:

D <- subset(SR1_data, filter3)

Where to go from here: First, check out “./SR1_exploratory2.Rmd”. The publication-quality analyzes and figures will be produced by “./SR1_CogSci2024_final.Rmd”

Getting started

To run this script on your computer, you may have to change the “trial_data_path” below. The file.choose() function will be your friend in this regard.

## Loading objects:
##   SR1_data
##   SR1_sbj
##   meta_data

## descriptor1: Raw data of SeeRel1 experiment. Collected in May-June 2023 at OSU. 
## descriptor2: Imported in R using SR1_get_data1.Rmd on Fri Jul 14 20:50:22 2023. 
## SeeRel1_root: /Users/petrov.11/work/RelExper/SeeRel1 
## data_path: data 
## raw_data.csv: SeeRelation_allData_formatted.csv 
## trial_data.RData: SR1_trial_data.RData 
## analysis_path: analysis 
## descriptor3: Dataframe SR1_trials loaded by SR1_get_data2.Rmd on Mon Jul 24 19:08:12 2023. 
## trial_data_path: /Users/petrov.11/work/RelExper/SeeRel1/data/SR1_main_data.RData 
## dropped_subj: c("16832", "18364", "24965", "29364", "29725", "38106") 
## zRT_threshold: 3 
## main_data.RData: SR1_main_data.RData 
## descriptor4: Dataframe SR1_data[23040x32] loaded by SR1_learning_effect1.Rmd on Thu May  9 14:57:22 2024.

Orientation and Preliminary Exploration

The data frame “SR1_data” contains data for the 15 participants who passed certain inclusion criteria. (See “./SR1_get_data2.Rmd”). Each participant performed two sessions of 768 trials each. (Zero entries correspond to participants who were excluded from the sample.)

##        subjNum
## session 10889 16832 18364 19616 24965 29364 29725 30474 30723 38106 46991 52502
##   video   768     0     0   768     0     0     0   768   768     0   768   768
##   ling    768     0     0   768     0     0     0   768   768     0   768   768
##        subjNum
## session 53927 55464 68030 69393 71440 79802 80116 87949 98309
##   video   768   768   768   768   768   768   768   768   768
##   ling    768   768   768   768   768   768   768   768   768

One trial-level exclusion criterion is that the response time should be neither too fast nor too slow: 200 <= RT < 2000. (See SR1_data$filter1 calculated by “./SR1_get_data2.Rmd”.) It makes sense here to clip the raw RTs to this range. Only 60 raw RTs were > 2000 ms and only 48 raw RTs were < 200 ms.

## The clip limits are global parameters:
PAR$RT_limits <- c(200, 2000)
meta_data$RT_limits <- PAR$RT_limits   # so that they will be saved in the .mat file

## Enforce the limits:
SR1_data$clippedRT <- pmax(PAR$RT_limits[1], pmin(PAR$RT_limits[2], SR1_data$RT))

As a very crude initial exploration, let’s just average the (clipped) RT data across the 15 individual participants. We’ll deal with individual differences later. We store the aggregated data in a data frame called “SR1_groupRT_data”:

## Average the RTs across the individual participants:
SR1_groupRT_data <- ddply(SR1_data, .(session, trialNum), function(D1) {
  data.frame(
    RT_mean  = mean(D1$clippedRT),
    N_sbj = nrow(D1)
)})

## Verify that there is an equal number of replications in each condition:
stopifnot(is.constant(SR1_groupRT_data$N_sbj))

## The number of participants in the sample is a global parameter:
PAR$N_sbj <- constant(SR1_groupRT_data$N_sbj)

## Scatter-plot the group-averaged RTs:
gp$RT_mean_scatter1 <- ggplot(SR1_groupRT_data, 
  aes(x = trialNum, y = RT_mean, group=session, color=session, shape=session)) + 
  geom_point()
print(gp$RT_mean_scatter1)

Plot the two sessions in chronological order rather than one underneath the other:

## The number of trials per session is a global parameter:
PAR$N_trials_per_session <- max(SR1_data$trialNum)    # 768

## Helper function:
convert_trialNum_to_trialNum2 <- function(trialNum, session)
  (trialNum + ifelse(session=="video", 0, PAR$N_trials_per_session))

## Define a new time variable:
SR1_groupRT_data$trialNum2 <- (with (SR1_groupRT_data,
  convert_trialNum_to_trialNum2(trialNum, session)))

## Scatter-plot 2: same data, new X axis
gp$RT_mean_scatter2 <- ggplot(SR1_groupRT_data, 
  aes(x = trialNum2, y = RT_mean, group=session, color=session, shape=session)) + 
  geom_point()
print(gp$RT_mean_scatter2)

Let’s smooth the data by averaging across mini-blocks of 8 consecutive trials:

## Global parameter:
#PAR$miniblock_size = 16
PAR$miniblock_size = 8

## Helper function:
convert_trialNum_to_trialMblk <- function(trialNum, miniblock_size = PAR$miniblock_size)
  (floor((trialNum-1) / miniblock_size) * miniblock_size + floor(miniblock_size/2))

## Define a new variable trialMblk that collapses each miniblock into a single point:
SR1_groupRT_data$trialMblk <- convert_trialNum_to_trialMblk(SR1_groupRT_data$trialNum)

SR1_groupRT_data$trialMblk2 <- (with (SR1_groupRT_data,
  convert_trialNum_to_trialNum2(trialMblk, session)))

## Average the RTs miniblocks:
SR1_groupRT_Mblk <- ddply(SR1_groupRT_data, .(session, trialMblk), function(D1) {
  data.frame(
    trialMblk2 = constant(D1$trialMblk2),
    RT_mean  = mean(D1$RT_mean),
    N_obs = nrow(D1) * constant(D1$N_sbj, verifyp=FALSE)   # number of observations
)})

## Sanity check: The number of observations in each miniblock should be 
#  120 = 15 subjects * 8 trials/miniblock = const
stopifnot(constant(SR1_groupRT_Mblk$N_obs) == PAR$N_sbj * PAR$miniblock_size)

## Scatter-plot the group-averaged RTs:
gp$RT_mean_scatter3 <- ggplot(SR1_groupRT_Mblk, 
  aes(x = trialMblk2, y = RT_mean, group=session, color=session, shape=session)) + 
  geom_point()
print(gp$RT_mean_scatter3)

`nls` Regression of the Group Data from the Video Session

We will model the learning curves as exponential functions (Heathcote, Brown, & Mewhort, 2000, Psychonomic Bulletin & Review, 7, 185-207). We parameterize these exponential functions following Equation 4 of Petrov & Hayes (2010, Journal of Vision, 10(14):11). Each learning curve is defined by three parameters: y192, y576, and tau. The first two parameters fix the mean RTs at two reference (or “pivot”) trials: 192 and 576 since the beginning of the session. These reference trials were chosen a priori as a function of session length. The estimated learning cirves (as opposed to their parameters) should be relatively insensitive to the choice of pivot times. See “./expTransf/expTransf_example.Rmd” for a related example.

## Fix the reference trials a priori:
PAR$ref_trials <- c(1/4, 3/4) * PAR$N_trials_per_session   # c(192, 576)

## Helper function:
find_RT_at_reference_trials <- function(t, RT, ref_trials = PAR$ref_trials) {
  stopifnot(length(t) == length(RT))   # must be vectors of equal length
  
  # Find the RT at the first reference trial:
  #y192 <- RT[t == PAR$ref_trials[1]]
  y192 <- RT[which.min(abs(t - PAR$ref_trials[1]))]
  
  # Find the RT at the first reference trial:
  #y576 <- RT[t == PAR$ref_trials[2]]
  y576 <- RT[which.min(abs(t - PAR$ref_trials[2]))]
  
  # Pack as a list and return:
  list(y192 = y192, y576 = y576)
}

The nonlinear, decelerating shape of the learning curves is governed by the time-constant parameter tau. It is measured in units of time – that is, trials. Thus the ratio t/tau is dimensionless and can be passed as an argument to exp(). We specify the exact exponential equation by means of an object of class formula, which can be passed to nls() to estimate the parameters (Venables & Ripley, 2010, Modern Applied Statistics with S).

Note that the reference times are hardwired in the formula. Make sure to redefine the formula if you change PAR$ref_trials.

## Make the formula object. This one formula will be used for all model fits:
M$nls_formula <- RT ~ (y192 * (  exp(-t/tau) - exp(-576/tau)) + 
                       y576 * (exp(-192/tau) -   exp(-t/tau))) / 
                      (exp(-192/tau) - exp(-576/tau))

## Sanity check. Because the reference times are hardwired in the formula,
#  assert that they match the corresponding values in PAR$ref_trials:
stopifnot(PAR$ref_trials[1] == 192)
stopifnot(PAR$ref_trials[2] == 576)

## Initial guess for the time-constant parameter:
PAR$guess_tau = 150    # trials

The following function implements the exponential model directly:

## Take an integer index into model_fits and return an object of class "lm" or "nls":
exp_model <- function(t, y192, y576, tau,
                      t192 = PAR$ref_trials[1], t576 = PAR$ref_trials[2]) {
  # Calculate subexpressions:
  exp_t_tau    <- exp(   -t/tau)
  exp_t192_tau <- exp(-t192/tau)
  exp_t576_tau <- exp(-t576/tau)
  
  # Combine the subexpressions into the return value:
  ((y192 * (exp_t_tau - exp_t576_tau) + 
    y576 * (exp_t192_tau - exp_t_tau) ) /
   (exp_t192_tau - exp_t576_tau))
}

Let’s fit this exponential learning model to the group data for the video session.

The code below uses the following template: For each analysis, the relevant input data is formatted as a data frame with 768 rows and stored in the variable D. The initial guess for the parameter vector is stored in the variable theta0. Then stats::nls() is called to perform the model fit using nonlinear least squares. The resulting nls model object is stored in M under a unique name.

## Extract the data and store it in D:
D <- with(subset(SR1_groupRT_data, session == "video"), 
          data.frame(session = session, t = trialNum, RT = RT_mean))
stopifnot(nrow(D) == PAR$N_trials_per_session)

## Construct an initial guess for the three parameters:
theta0 <- find_RT_at_reference_trials(D$t, D$RT)
theta0$tau <- PAR$guess_tau

## Call nls() to do the real work:
currM <- nls(M$nls_formula, D, theta0)
M$nls_group_video <- currM

## Print a summary of the results and store it as an object that can be unpacked:
(currS <- summary(currM))

## 
## Formula: RT ~ (y192 * (exp(-t/tau) - exp(-576/tau)) + y576 * (exp(-192/tau) - 
##     exp(-t/tau)))/(exp(-192/tau) - exp(-576/tau))
## 
## Parameters:
##      Estimate Std. Error t value Pr(>|t|)    
## y192  523.579      2.844   184.1   <2e-16 ***
## y576  479.549      2.175   220.5   <2e-16 ***
## tau   123.594      8.523    14.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.66 on 765 degrees of freedom
## 
## Number of iterations to convergence: 8 
## Achieved convergence tolerance: 8.263e-06

Last but not least, the parameter estimates (and their corresponding standard errors) are extracted using this helper function:

## A function that takes an nls model object and returns a one-row data frame:
extract_param_estimates <- function (currS) {
  data.frame(y192    = coef(currS)[["y192", "Estimate"]],
             y192_se = coef(currS)[["y192", "Std. Error"]],
             y576    = coef(currS)[["y576", "Estimate"]],
             y576_se = coef(currS)[["y576", "Std. Error"]], 
             tau     = coef(currS)[["tau",  "Estimate"]],
             tau_se  = coef(currS)[["tau",  "Std. Error"]],
             RSE     = currS$sigma,
             status  = currS$convInfo[["stopMessage"]])
}  # extract_param_estimates

## Extract parameter estimates and store them as a row in param_estimates:
group_params <- cbind(data.frame(model = "nls_group_video", 
                                 session = constant(D$session)),
                      extract_param_estimates(currS))
print(group_params, digits = 3)

##             model session y192 y192_se y576 y576_se tau tau_se  RSE    status
## 1 nls_group_video   video  524    2.84  480    2.18 124   8.52 43.7 converged

Add the fitted values and residuals to the data frame:

## Estimated learning curve:
D$RT_hat <- fitted(currM)

## Residuals. The sign must be flipped:
D$RT_resid <-- -residuals(currM)

## The residuals should have mean==0.0 and std.dev==RSE*(n-1)/n:
summary(D$RT_resid)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -142.859  -28.111   -7.387    0.000   20.034  249.553

cat(sprintf("sd(RT_resid) = %.2f, RSE = %.2f.", sd(D$RT_resid), sigma(currM)))

## sd(RT_resid) = 43.60, RSE = 43.66.

Sanity check: Verify that the function exp_model() produces the same estimated learning curve (within rounding error):

## Calculate the fitted values directly:
D$RT_hat1 <- exp_model(D$t, group_params$y192, group_params$y576, group_params$tau)

## Sanity check:
stopifnot(sum((D$RT_hat - D$RT_hat1)^2) < 1e-10)  # allowance for rounding errors

Superimpose the fitted learning curve onto the raw scatterplot:

## Scatter-plot with a superimposed learning curve:
gp$RT_group_lcurve1 <- ggplot(D, 
  aes(x = t, y = RT, shape=session)) + 
  geom_point(color="#AAAAAA") +
  geom_line(aes(y = RT_hat, color=session), linewidth=1)
print(gp$RT_group_lcurve1)

## Plot the residuals:
gp$RT_group_residuals1 <- ggplot(D,
  aes(x = t, y = RT_resid, shape=session, color=session)) + 
  geom_point()
print(gp$RT_group_residuals1)

`nls` Regression of the Group Data from the Video Session

Repeat the exercise for the group data from the linguistic session. It turns out that nls() does not converge on these data. This is because the data are scattered basically around a straight line, which cannot be fit well with an exponential function starting at trial=1.

## Extract the linguistic data and store it in D:
D <- with(subset(SR1_groupRT_data, session == "ling"), 
          data.frame(session = session, t = trialNum, RT = RT_mean))
stopifnot(nrow(D) == PAR$N_trials_per_session)

## Construct an initial guess for the three parameters:
theta0 <- find_RT_at_reference_trials(D$t, D$RT)
theta0$tau <- PAR$guess_tau

## Call nls() to do the real work.
#  Because it generates an error under default setting, we must change
#  the control setting to issue a warning instead of a fatal error:
currM <- nls(M$nls_formula, D, theta0, control= list(warnOnly = TRUE))

## Warning in nls(M$nls_formula, D, theta0, control = list(warnOnly = TRUE)):
## number of iterations exceeded maximum of 50

M$nls_group_ling <- currM

## Print a summary of the results and store it as an object that can be unpacked:
(currS <- summary(currM))

## 
## Formula: RT ~ (y192 * (exp(-t/tau) - exp(-576/tau)) + y576 * (exp(-192/tau) - 
##     exp(-t/tau)))/(exp(-192/tau) - exp(-576/tau))
## 
## Parameters:
##      Estimate Std. Error t value Pr(>|t|)    
## y192  398.808      1.248 319.452   <2e-16 ***
## y576  398.808      1.248 319.431   <2e-16 ***
## tau    14.075      5.475   2.571   0.0103 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.3 on 765 degrees of freedom
## 
## Number of iterations till stop: 50 
## Achieved convergence tolerance: 0.1113
## Reason stopped: number of iterations exceeded maximum of 50

## Extract parameter estimates and add them as a new row in param_estimates:
temp <- cbind(data.frame(model = "nls_group_ling", 
                         session = constant(D$session)),
              extract_param_estimates(currS))
group_params <- rbind(group_params, temp)   # add temp as a new row at the bottom
print(group_params, digits = 3)

##             model session y192 y192_se y576 y576_se   tau tau_se  RSE
## 1 nls_group_video   video  524    2.84  480    2.18 123.6   8.52 43.7
## 2  nls_group_ling    ling  399    1.25  399    1.25  14.1   5.48 33.3
##                                        status
## 1                                   converged
## 2 number of iterations exceeded maximum of 50

It is probably a better idea to fit a simpler, linear model to the data from the linguistic session. Before we commit to this, however, let’s look at the raw data of the individual participants.

## Remove temporary variables from the workspace:
rm("temp", "D", "theta0", "currM", "currS")

Plot the Individual Learning Curves

Let’s make RT scatterplots separately for each individual participant. To make the plots easier to “fit by eye”, it is useful to smooth the data by averaging across miniblocks. We define a new parameter PAR$miniblock_size1 -- it is the individual-level counterpart of the group-level PAR$miniblock_size.

## Miniblocks at the individual level:
PAR$miniblock_size1 <- 16     # do not confuse with PAR$miniblock_size

## Define a new variable trialMblk that collapses each miniblock into a single point:
#  Note: This time we modify the main dataframe SR1_data instead of SR1_groupRT_data.
SR1_data$trialMblk <- 
    convert_trialNum_to_trialMblk(SR1_data$trialNum, PAR$miniblock_size1)

## Average the RTs miniblocks:
SR1_indivRT_Mblk <- ddply(SR1_data, .(subjNum, session, trialMblk), function(D1) {
  data.frame(
    RT_mean  = mean(D1$clippedRT),
    N_obs = nrow(D1)    # number of observations
)})

## Sanity check:
stopifnot(constant(SR1_indivRT_Mblk$N_obs) == PAR$miniblock_size1)

## Alternative time variable -- useful for plotting sessions side-by-side:
SR1_indivRT_Mblk$trialMblk2 <- (with (SR1_indivRT_Mblk,
  convert_trialNum_to_trialNum2(trialMblk, session)))

## Scatter-plot the miniblock-averaged RTs separately for each individual:
gp$RT_indiv_scatter1 <- ggplot(SR1_indivRT_Mblk, 
  aes(x = trialMblk2, y = RT_mean, group=session, color=session, shape=session)) + 
  geom_point() + facet_wrap(vars(subjNum))
print(gp$RT_indiv_scatter1)

Individual Nonlinear Fits

Our ultimate goal is to remove the learning effects from the RT data by regressing away the learning curve separately for each participant in each session. It seems clear from eyeballing the individual scatterplots that we should consider two alternative models – exponential and linear.

Let’s adopt the following strategy: We are going to fit both models for the clipped RT data of each individual participant in each session. If nls() fails to converge on a given data set with default control parameters, we decide that the exponential model is not a viable model for these data. If nls() does converge, then we do a model comparison to decide whether the exponential model (which has one extra parameter) or the linear models should be preferred.

As it will soon become clear, we need to smooth out the raw data to facilitate convergence of the nls() algorithm. The smoothing is performed by averaging within miniblocks. The following helper function accomplishes this:

## Helper function that produces a data frame D with SMOOTHED data for 
#  a given participant in a given session:
extract_and_smooth_indiv_RTs <- function(which_subjNum, which_session, miniblock_size,
                                                       big_data = SR1_data) {
  # Extract the relevant subset of the big data frame. It has 768 rows:
  D768 <- subset(big_data, subjNum == which_subjNum & session == which_session)
  stopifnot(nrow(D768) == PAR$N_trials_per_session)   # 768
  
  # Re-calculate trialMblk with the specified miniblock_size:
  if (miniblock_size > 1) {
    D768$trialMblk <- convert_trialNum_to_trialMblk(D768$trialNum, miniblock_size)
  } else {
    stopifnot(miniblock_size == 1)
    # convert_trialNum_to_trialMblk doesn't handle this special case very well
    D768$trialMblk <- D768$trialNum   
  }

  # Average the clipped RTs within each miniblock:
  D <- ddply(D768, .(trialMblk), function(D1) {
         data.frame(
           subjNum = constant(D1$subjNum, verifyp = FALSE),
           session = constant(D1$session, verifyp = FALSE),
           t       = constant(D1$trialMblk, verifyp = FALSE),
           RT      = mean(D1$clippedRT),
           N_obs   = nrow(D1)
       )}) # end of ddply(...)
  
  # Sanity checks:
  stopifnot(nrow(D) == PAR$N_trials_per_session / miniblock_size)
  stopifnot(constant(D$N_obs) == miniblock_size)
  
  # Return the resulting data frame:
  D
} # extract_and_smooth_indiv_RTs

We store the residual squared error (RSE) and other results from these calculations in a new data frame model_fits:

Fix a miniblock_size parameter and fit 30 linear models + 30 exponential models. There is a hyper-parameter PAR$miniblock_size_nls (aka mb_sz). We ran the script below multiple times for mb_sz in c(1, 2, 4, 6, 8). The results are summarized in the pre-formatted block of text below. The overall conclusion of these trial-and-error explorations is that the best value is PAR$miniblock_size_nls == 4. Accordingly, it is set as the default in the “production” code below.

I’ve added a few “magic” special cases to customize mb_sz for certain individual cases. With these customizations, nls() converges for the 11 of the 15 data sets in the video session and for the 7 of the 15 data sets in the linguistic session.

## Pick a value for PAR$miniblock_size_nls:
PAR$miniblock_size_nls <- 4

## For each <subj, session> pair, fit both models:
for (k in seq(1, nrow(model_fits))) {
  ## Determine customized smoothing parameter:
  curr_subjNum <- as.character(model_fits$subjNum[[k]])  # e.g., "10889"
  curr_session <- as.character(model_fits$session[[k]])  # "video" or "ling"
  curr_miniblock_size <- PAR$miniblock_size_nls  # UNLESS ...
  
  # ... UNLESS we are in one of the following "magic" cases.
  # The customized mb_sz values were determined by trial and error:
  if (curr_subjNum == "69393") { curr_miniblock_size <- 1  }  # for both sessions
  if (curr_subjNum == "80116" & curr_session == "video") {
    curr_miniblock_size <- 2  }
  if (curr_subjNum == "30474" & curr_session == "ling") {
    curr_miniblock_size <- 1  }
  if (curr_subjNum == "10889" & curr_session == "ling") {
    curr_miniblock_size <- 6  }
  
  # store the customized smoothing parameter the data frame:
  model_fits$mb_sz[[k]] <- curr_miniblock_size
  
  ## Extract the relevant data
  D <- extract_and_smooth_indiv_RTs(curr_subjNum, curr_session, curr_miniblock_size)

  ## Fit the linear model. It always converges
  linM <- lm(RT ~ t, D)   # an object of class "lm"
  linS <- summary(linM)   # an object of class "summary.lm"
  
  ## Store information about the linear fit in the data frame:
  model_fits$lin_RSE[[k]] <- sigma(linM)
  model_fits$lin_intcpt[[k]]    <- coef(linS)[["(Intercept)", "Estimate"]]
  model_fits$lin_intcpt_se[[k]] <- coef(linS)[["(Intercept)", "Std. Error"]]
  model_fits$lin_slope[[k]]     <- coef(linS)[["t", "Estimate"]]
  model_fits$lin_slope_se[[k]]  <- coef(linS)[["t", "Std. Error"]]
  
  ## Fit the exponential model. It may NOT converge!
  #  Construct an initial guess for the three parameters:
  theta0 <- find_RT_at_reference_trials(D$t, D$RT)
  theta0$tau <- PAR$guess_tau
  #  Call nls() and check for convergence.  expM is an object of class "nls"
  expM <- nls(M$nls_formula, D, theta0, control= list(warnOnly = TRUE))
  if (expM$convInfo$isConv) {
    # The exponential model converged -- store the parameter estimates:
    model_fits$nls_isConv[[k]] <- TRUE
    model_fits$exp_RSE[[k]] <- sigma(expM)
    
    expS <- summary(expM)   # an object of class "summary.nls"
    model_fits$exp_y192[[k]]    <- coef(expS)[["y192", "Estimate"]]
    model_fits$exp_y192_se[[k]] <- coef(expS)[["y192", "Std. Error"]]
    model_fits$exp_y576[[k]]    <- coef(expS)[["y576", "Estimate"]]
    model_fits$exp_y576_se[[k]] <- coef(expS)[["y576", "Std. Error"]]
    model_fits$exp_tau[[k]]     <- coef(expS)[["tau", "Estimate"]]
    model_fits$exp_tau_se[[k]]  <- coef(expS)[["tau", "Std. Error"]]
    
  } else {
    # The exponential model did not converge -- do nothing:
    # The data frame is initialized with correct values (mostly NAs).
    expS <- NULL  # erase any leftovers from the previous iteration, just in case
    
  } # end if (expM$convInfo$isConv)
  
} # for k

## Results of fitting the models with PAR$miniblock_size_nls == 4:

## 
##   In the video session, nls() converged for these 11 participants:

##    subjNum session mb_sz lin_RSE exp_RSE exp_tau
## 5    30474   video     4   82.94   77.96  111.11
## 7    30723   video     4   85.13   80.63   85.00
## 9    46991   video     4   96.78   94.67  210.19
## 11   52502   video     4   94.80   91.19  437.78
## 13   53927   video     4   65.10   58.05   24.48
## 15   55464   video     4  107.37   89.06   73.39
## 17   68030   video     4   78.64   78.42  300.24
## 19   69393   video     1  195.87  192.87   74.30
## 21   71440   video     4   98.72   76.06   31.69
## 25   80116   video     2  131.51  125.91  119.83
## 27   87949   video     4   90.71   76.14  190.94

## 
##   In the ling session, nls() converged for these 7 participants:

##    subjNum session mb_sz lin_RSE exp_RSE exp_tau
## 2    10889    ling     6   26.00   25.99  -42.88
## 6    30474    ling     1   86.99   86.70   60.86
## 18   68030    ling     4   41.37   38.83  218.74
## 20   69393    ling     1  204.69  204.81  907.50
## 22   71440    ling     4   49.05   49.13  349.48
## 26   80116    ling     4  107.86  107.88  508.99
## 28   87949    ling     4   56.34   54.60  128.90

## Remove temporary variables from the workspace:
rm("k", "curr_subjNum", "curr_session", "curr_miniblock_size", 
   "D", "theta0", "linM", "linS", "expM", "expS")

Experiments with various values of PAR$miniblock_size_nls produced the following results. These trial-and-error runs were used to determine the “magic” customized values for the miniblock_size smoothing parameter in the preceding code cell.

##################################################################
Results of fitting the models with PAR$miniblock_size_nls == 1:

In the video session, nls() converged for these 8 participants:
     subjNum session lin_RSE exp_RSE exp_tau
  5    30474   video     135     132   111.8
  9    46991   video     184     183   209.5
  11   52502   video     153     151   439.5
  15   55464   video     171     160    72.1
  17   68030   video     159     159   295.9
  19   69393   video     196     193    74.3   # <-- unstable
  25   80116   video     172     167   119.4
  27   87949   video     150     142   190.4

In the ling session, nls() converged for these 5 participants:
     subjNum session lin_RSE exp_RSE exp_tau
  6    30474    ling    87.0    86.7    60.9   # <-- unstable
  18   68030    ling    76.9    75.6   219.5
  20   69393    ling   204.7   204.8   907.5   # <-- unstable
  22   71440    ling    94.3    94.4   349.6
  28   87949    ling    97.0    95.9   129.6

##################################################################
Results of fitting the models with PAR$miniblock_size_nls == 2:

In the video session, nls() converged for these 8 participants:
     subjNum session lin_RSE exp_RSE exp_tau
  5    30474   video   106.2   102.4  112.03
  7    30723   video   125.4   122.4   83.09    # <-- new
  9    46991   video   137.0   135.5  209.68
  11   52502   video   116.6   113.6  439.81
  15   55464   video   135.2   121.1   71.62
  17   68030   video   109.7   109.5  297.61
  25   80116   video   131.5   125.9  119.83
  27   87949   video   116.0   105.0  190.89

In the ling session, nls() converged for these 5 participants:
     subjNum session lin_RSE exp_RSE exp_tau
  12   52502    ling  126.68  124.48  -22.89    # <-- new
  18   68030    ling   55.45   53.60  219.28
  22   71440    ling   70.16   70.21  350.28
  26   80116    ling  126.25  126.19  509.81    # <-- new
  28   87949    ling   71.18   69.80  130.18

##################################################################
Results of fitting the models with PAR$miniblock_size_nls == 4:

In the video session, nls() converged for these 10 participants:
     subjNum session lin_RSE exp_RSE exp_tau
  5    30474   video   82.94   77.96  111.11
  7    30723   video   85.13   80.63   85.00
  9    46991   video   96.78   94.67  210.19
  11   52502   video   94.80   91.19  437.78
  13   53927   video   65.10   58.05   24.48    # <-- new
  15   55464   video  107.37   89.06   73.39
  17   68030   video   78.64   78.42  300.24
  21   71440   video   98.72   76.06   31.69    # <-- new
  25   80116   video  113.93  125.91  119.83    # <-- unstable
  27   87949   video   90.71   76.14  190.94

In the ling session, nls() converged for these 5 participants:
     subjNum session lin_RSE exp_RSE exp_tau
  12   52502    ling   97.21  124.48  -22.89
  18   68030    ling   41.37   38.83  218.74
  22   71440    ling   49.05   49.13  349.48
  26   80116    ling  107.86  107.88  508.99
  28   87949    ling   56.34   54.60  128.90

##################################################################
Results of fitting the models with PAR$miniblock_size_nls == 6:

In the video session, nls() converged for these 8 participants:
     subjNum session mb_sz lin_RSE exp_RSE exp_tau
  7    30723   video     6   75.62   70.40   82.76
  9    46991   video     6   82.28   79.77  209.26
  11   52502   video     6   83.99   79.97  440.28
  15   55464   video     6   98.68   78.38   74.19
  17   68030   video     6   62.17   61.86  297.47
  19   69393   video     6   98.97   92.99   75.47
  25   80116   video     6   95.17   87.31  119.59
  27   87949   video     6   78.22   60.73  191.63

In the ling session, nls() converged for these 5 participants:
     subjNum session mb_sz lin_RSE exp_RSE exp_tau
  2    10889    ling     6   26.00   25.99  -42.88    # <-- new
  18   68030    ling     6   37.23   34.46  221.66
  22   71440    ling     6   39.12   39.21  344.23
  26   80116    ling     6   92.00   92.05  511.73
  28   87949    ling     6   46.22   44.03  126.67

##################################################################
Running the same script with PAR$miniblock_size_nls == 8
produced a fatal error in one of the nls() attempts.

##################################################################

*******  Final Results of the Trial-and-Error Runs *******
Running the same script with CUSTOMIZED miniblock_size produced
the following results, 2024-05-08:

In the video session, nls() converged for these 11 participants:
   subjNum session mb_sz lin_RSE exp_RSE exp_tau
  5    30474   video     4   82.94   77.96  111.11
  7    30723   video     4   85.13   80.63   85.00
  9    46991   video     4   96.78   94.67  210.19
  11   52502   video     4   94.80   91.19  437.78
  13   53927   video     4   65.10   58.05   24.48
  15   55464   video     4  107.37   89.06   73.39
  17   68030   video     4   78.64   78.42  300.24  # will be rejected by AIC
  19   69393   video     1  195.87  192.87   74.30
  21   71440   video     4   98.72   76.06   31.69
  25   80116   video     2  131.51  125.91  119.83
  27   87949   video     4   90.71   76.14  190.94

In the ling session, nls() converged for these 7 participants:
     subjNum session mb_sz lin_RSE exp_RSE exp_tau
  2    10889    ling     6   26.00   25.99  -42.88  # negative tau makes no sense
  6    30474    ling     1   86.99   86.70   60.86
  18   68030    ling     4   41.37   38.83  218.74
  20   69393    ling     1  204.69  204.81  907.50  # exp_RSE > lin_RSE
  22   71440    ling     4   49.05   49.13  349.48  # exp_RSE > lin_RSE
  26   80116    ling     4  107.86  107.88  508.99  # exp_RSE > lin_RSE
  28   87949    ling     4   56.34   54.60  128.90

##################################################################

Model Comparison

We are going to use the Akaike information criterion to select among the two competing models in each case. Theoretically, the exponential model subsumes the linear model as a special case when the time constant tau grows very large. Thus, the exponential model should always achieve better fit. This is not true in practice because of the vagaries of the iterative optimization algorithm within nls(). But if we assume that nls() did minimize the residual error (RSE), then it should always be the case that exp_RSE <= lin_RSE. The question is whether this drop can be attributed to overfitting from of the extra free parameter exp_tau.

The Akaike information criterion can be calculated from the least-squares solution by the formula: delta_AIC = 2*k + n*log(RSE^2) = 2*k + 2*n*log(RSE), where n is the number of data points used to fit the models.

Reference: Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference: A practical information-theoretic approach (2nd ed.), p. 63, cited in https://en.wikipedia.org/wiki/Akaike_information_criterion.

The model with minimum AIC is preferred. The following helper function calculates the difference AIC(lin_model) - AIC(exp_model). Thus, positive values favor the exponential model:

## Akaike information criterion:
delta_AIC <- function(lin_RSE, exp_RSE, N_obs) {
  # The linear model has 2 parameters:
  AIC_lin <- 2*2 + 2*N_obs*log(lin_RSE)
  
  # The exponential model has 3 parameters:
  AIC_exp <- 2*3 + 2*N_obs*log(exp_RSE)
  
  # Return the difference. Positive values favor the exponential model
  AIC_lin - AIC_exp
}

We are set to perform the actual model comparisons:

## Add a new column -- lcType, short for "learning-curve model type":
model_fits$lcType <- NA

## The linear model wins by default whenever nls() failed to converge:
idx <- !model_fits$nls_isConv     # a vector of logical values
model_fits$lcType[idx] <- "lin"

## The linear model also wins when the time-constant estimate is negative:
idx <- (model_fits$nls_isConv & (model_fits$exp_tau <= 0))
model_fits$lcType[idx] <- "lin"

## The linear model also wins whenever lin_RSE <= exp_RSE:
#  (This happens when nls() failed to find the true optimal solution for
#  the non-linear model.)
idx <- (model_fits$nls_isConv & (model_fits$lin_RSE <= model_fits$exp_RSE))
model_fits$lcType[idx] <- "lin"

## At this point, only three data sets are left standing in the linguistic
## condition -- indices 6, 18, and 28 -- subjNum 30474, 68030, and 87949.
## In the video condition, all 11 cases for which nls() converged are still
## standing.  We appeal to AIC to resolve these non-trivial cases:

## Add yet another column to model_fits:
model_fits$delta_AIC <- NA
idx <- model_fits$nls_isConv
model_fits$delta_AIC[idx] <- delta_AIC(model_fits$lin_RSE[idx],
                                       model_fits$exp_RSE[idx],
                PAR$N_trials_per_session/model_fits$mb_sz[idx])

## Positive values of delta_AIC favor the exponential model:
idx <- is.na(model_fits$lcType)   # the cases that are still not resolved
model_fits$lcType[idx] <- ifelse(model_fits$delta_AIC[idx] > 0,
                                 "exp",    # AIC favors the exponential model
                                 "lin")    # AIC favors the linear model

## The AIC criterion favored "lin" in only one of the contested cases --
## that for participant 68030 in the video condition (idx==17).
## After all said and done, 10 of the 15 cases in the video condition have
## lcType=="exp", as contrasted to only 3 cases in the linguistic condition.
## This is reflected in the following table:
with(model_fits, table(session, lcType))

##        lcType
## session exp lin
##   video  10   5
##   ling    3  12

Estimate the Individual Learning Curves

The following function is a wrapper around exp_model() defined earlier. It takes a single argument – an integer row-index into model_fits. It returns a function of a single argument, t, that can be applied to data.

## Take an integer index into model_fits and return function of t:
indiv_exp_model <- function(k, params = model_fits) {
  ## Retrieve the relevant parameters:
  stopifnot(params$nls_isConv[[k]])   # sanity check
  y192 <- params$exp_y192[[k]]
  y576 <- params$exp_y576[[k]]
  tau  <- params$exp_tau[[k]]
  
  ## Construct and return a function of a single argument:
  function(t)
    exp_model(t, y192, y576, tau)
} # indiv_exp_model

## Test case illustrating typical usage:
k <- 5 
D <- subset(SR1_data, subjNum == model_fits$subjNum[[k]] & session == model_fits$session[[k]])
indivF <- indiv_exp_model(k)
D$RT_lcurve <- indivF(D$trialNum)

## Plot the fit:
print(ggplot(D, aes(x = trialNum, y = clippedRT)) + geom_point(color="#AAAAAA") +
        geom_line(aes(y = RT_lcurve), color="red"))

Finally we are in a position to estimate the learning curve for each individual data set.

## We'll add a few columns to the big data frame, to be populated in the loop below:
SR1_data$lcType    <- NA
SR1_data$lcRSE     <- NA
SR1_data$exp_tau   <- NA
SR1_data$delta_AIC <- NA
SR1_data$RT_lcurve <- NA  # Sic!
SR1_data$residRT  <- NA  # Sic!

## For each <subj, session> pair, calculate RT_lcurve from the appropriate model:
for (k in seq(1, nrow(model_fits))) {
  ## What are we dealing with?
  curr_subjNum <- as.character(model_fits$subjNum[[k]])  # e.g., "10889"
  curr_session <- as.character(model_fits$session[[k]])  # "video" or "ling"
  curr_lcType  <- model_fits$lcType[[k]]                 # "lin" or "exp"
  
  ## A logical index to the relevant rows in the big data frame:
  idx <- (SR1_data$subjNum == curr_subjNum & SR1_data$session == curr_session)
  stopifnot(sum(idx) == PAR$N_trials_per_session)
  
  ## Bookkeeping:
  SR1_data$lcType[idx] <- curr_lcType
  SR1_data$exp_tau[idx] <- model_fits$exp_tau[[k]]
  SR1_data$delta_AIC[idx] <- model_fits$delta_AIC[[k]]
  
  ## Extract the relevant data:
  D <- data.frame(t = SR1_data$trialNum[idx], RT = SR1_data$clippedRT[idx])
  stopifnot(nrow(D) == PAR$N_trials_per_session)
  
  ## Branch depending on model type
  if (curr_lcType == "lin") {
    # Fit a linear model:
    linM <- lm(RT ~ t, D)   # an object of class "lm"
    
    # Store the fitted values in the main data frame:
    SR1_data$RT_lcurve[idx] <- fitted(linM)  # RELIES ON IMPLICIT ORDER
    
    # Identify the model_RSE for later sanity check:
    curr_model_RSE <- model_fits$lin_RSE[[k]]  # cf sigma(linM)

  } else {
    stopifnot(curr_lcType == "exp")
    
    # Make a function that implements the individualized exponential model:
    indivF <- indiv_exp_model(k)
    
    # Apply this function to the data and store the fitted values:
    SR1_data$RT_lcurve[idx] <- indivF(D$t)
    
    # Identify the model_RSE for later sanity check:
    curr_model_RSE <- model_fits$exp_RSE[[k]]
    
  } # if curr_lcType

  ## It is time to reap the fruits of all these efforts -- calculate residRT:
  SR1_data$residRT[idx] <- SR1_data$clippedRT[idx] - SR1_data$RT_lcurve[idx]
  
  ## Calculate the residual standard error (RSE) and store it in the data frame:
  SR1_data$lcRSE[idx] <- sd(SR1_data$residRT[idx])
  
  # Sanity check: lcRSE should not be too far off curr_model_RSE
  curr_model_RSE <- curr_model_RSE * sqrt(model_fits$mb_sz[[k]])
  cat(sprintf(
    "k=%2d, subj=%s, session=%5s, type=%s, lcRSE = %7.3f, modelRSE = %7.3f \n",
              k, curr_subjNum, curr_session, curr_lcType,
              constant(SR1_data$lcRSE[idx], verifyp = FALSE),
              curr_model_RSE))

} # for k

## k= 1, subj=10889, session=video, type=lin, lcRSE = 143.271, modelRSE = 167.020 
## k= 2, subj=10889, session= ling, type=lin, lcRSE =  58.970, modelRSE =  63.675 
## k= 3, subj=19616, session=video, type=lin, lcRSE = 170.608, modelRSE = 182.513 
## k= 4, subj=19616, session= ling, type=lin, lcRSE =  78.907, modelRSE =  82.413 
## k= 5, subj=30474, session=video, type=exp, lcRSE = 131.403, modelRSE = 155.921 
## k= 6, subj=30474, session= ling, type=exp, lcRSE =  86.592, modelRSE =  86.705 
## k= 7, subj=30723, session=video, type=exp, lcRSE = 151.232, modelRSE = 161.251 
## k= 8, subj=30723, session= ling, type=lin, lcRSE =  82.618, modelRSE =  98.450 
## k= 9, subj=46991, session=video, type=exp, lcRSE = 183.007, modelRSE = 189.349 
## k=10, subj=46991, session= ling, type=lin, lcRSE =  95.207, modelRSE = 105.815 
## k=11, subj=52502, session=video, type=exp, lcRSE = 150.765, modelRSE = 182.379 
## k=12, subj=52502, session= ling, type=lin, lcRSE = 174.309, modelRSE = 194.418 
## k=13, subj=53927, session=video, type=exp, lcRSE = 112.004, modelRSE = 116.095 
## k=14, subj=53927, session= ling, type=lin, lcRSE =  53.342, modelRSE =  49.396 
## k=15, subj=55464, session=video, type=exp, lcRSE = 159.731, modelRSE = 178.119 
## k=16, subj=55464, session= ling, type=lin, lcRSE = 180.733, modelRSE = 187.157 
## k=17, subj=68030, session=video, type=lin, lcRSE = 158.757, modelRSE = 157.273 
## k=18, subj=68030, session= ling, type=exp, lcRSE =  75.488, modelRSE =  77.667 
## k=19, subj=69393, session=video, type=exp, lcRSE = 192.614, modelRSE = 192.866 
## k=20, subj=69393, session= ling, type=lin, lcRSE = 204.555, modelRSE = 204.689 
## k=21, subj=71440, session=video, type=exp, lcRSE = 143.482, modelRSE = 152.119 
## k=22, subj=71440, session= ling, type=lin, lcRSE =  94.287, modelRSE =  98.108 
## k=23, subj=79802, session=video, type=lin, lcRSE = 219.227, modelRSE = 257.525 
## k=24, subj=79802, session= ling, type=lin, lcRSE = 203.235, modelRSE = 232.498 
## k=25, subj=80116, session=video, type=exp, lcRSE = 167.282, modelRSE = 178.062 
## k=26, subj=80116, session= ling, type=lin, lcRSE = 160.596, modelRSE = 215.719 
## k=27, subj=87949, session=video, type=exp, lcRSE = 141.772, modelRSE = 152.277 
## k=28, subj=87949, session= ling, type=exp, lcRSE =  95.824, modelRSE = 109.201 
## k=29, subj=98309, session=video, type=lin, lcRSE = 245.119, modelRSE = 282.750 
## k=30, subj=98309, session= ling, type=lin, lcRSE = 108.740, modelRSE = 124.441

## Sanity checks:
stopifnot(all(!is.na(SR1_data$lcType)))
stopifnot(all(!is.na(SR1_data$lcRSE)))
stopifnot(all(!is.na(SR1_data$RT_lcurve)))
stopifnot(all(!is.na(SR1_data$residRT)))

## Convert SR1_data$lcType to factor:
SR1_data$lcType <- factor(SR1_data$lcType)

## Set undefined exp_tau values to 9999.9. In all cases when exp_tau is not
#  defined, we used the linear model by default. Theoretically, the linear
#  model is a special case of the exponental one when the time constant
#  tau gets very large.  9999.9 is two orders of magnitude larger than any
#  of the valid exp_tau estimates:
idx <- is.na(SR1_data$exp_tau)
SR1_data$exp_tau[idx] <- 9999.9
stopifnot(all(!is.na(SR1_data$exp_tau)))

SR1_data$delta_AIC[idx] <- -9999.9   # negative AIC favor the linear model
stopifnot(all(!is.na(SR1_data$delta_AIC)))

## Descriptive statistics:
summary(SR1_data$residRT)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -590.5877  -60.3483  -21.6858    0.0699   18.9305 1635.7277

## Remove temporary variables from the workspace:
rm("k", "curr_subjNum", "curr_session", "curr_lcType", "curr_model_RSE",
   "idx", "D", "linM", "indivF")

Plot Again the Individual Learning Curves

Re-create the individual scatter plots from an earlier section, but this time superimpose the corresponding learning-curve fits:

## Re-create SR1_indivRT_Mblk, this time adding a column for RT_lcurve:
#  (See the R cell named "scatter_by_sbj1" around line 430 above.)
SR1_indivRT_Mblk <- NULL  # remove the old data frame to prevent interference
SR1_indivRT_Mblk <- ddply(SR1_data, .(subjNum, session, trialMblk), function(D1) {
  data.frame(
    RT_mean   = mean(D1$clippedRT),
    RT_lcurve = mean(D1$RT_lcurve),   # <-- new variable
    N_obs = nrow(D1)                  # number of observations
)})

## Sanity check:
stopifnot(constant(SR1_indivRT_Mblk$N_obs) == PAR$miniblock_size1)

## Alternative time variable -- useful for plotting sessions side-by-side:
SR1_indivRT_Mblk$trialMblk2 <- (with (SR1_indivRT_Mblk,
  convert_trialNum_to_trialNum2(trialMblk, session)))

## Scatter-plot the miniblock-averaged RTs separately for each individual:
gp$RT_indiv_scatter2 <- ggplot(SR1_indivRT_Mblk, 
  aes(x = trialMblk2, y = RT_mean, shape=session)) + 
  geom_point(color="#AAAAAA") + facet_wrap(vars(subjNum)) +
  geom_line(aes(y = RT_lcurve, color=session), linewidth=1)
print(gp$RT_indiv_scatter2)

## Plot the residuals:
SR1_indivRT_Mblk$residRT <- SR1_indivRT_Mblk$RT_mean - SR1_indivRT_Mblk$RT_lcurve
gp$RT_indiv_residuals2 <- ggplot(SR1_indivRT_Mblk,
  aes(x = trialMblk2, y = residRT, shape=session, color=session)) + 
  geom_point() + facet_wrap(vars(subjNum))
print(gp$RT_indiv_residuals2)

These plots demonstrate that our learning-curve fits do pass through the middle of their respective clouds of data points.

Note that in 4 of the 30 cases the learning curves go slightly upward. I’m not sure whether this is a cause for concern or not. Apparently, in addition to the learning effects there are also fatigue effects, particularly on Day 2.

Let’s replot the group-averaged figure too. It is a candidate for a figure in the CogSci paper.

## Re-create SR1_groupRT_data, this time adding a column for RT_lcurve:
#  (See the R cell named "group_RT_data1" around line 130 above.)
SR1_groupRT_data <- NULL  # remove the old data frame to prevent interference
SR1_groupRT_data <- ddply(SR1_data, .(session, trialNum), function(D1) {
  data.frame(
    RT_mean  = mean(D1$clippedRT),
    RT_lcurve = mean(D1$RT_lcurve),   # <-- new variable
    N_sbj = nrow(D1)
)})
stopifnot(is.constant(SR1_groupRT_data$N_sbj))

## Define new time variables:   (This just repeats what we did earlier.)
SR1_groupRT_data$trialNum2 <- (with (SR1_groupRT_data,
  convert_trialNum_to_trialNum2(trialNum, session)))

SR1_groupRT_data$trialMblk <- convert_trialNum_to_trialMblk(SR1_groupRT_data$trialNum)

SR1_groupRT_data$trialMblk2 <- (with (SR1_groupRT_data,
  convert_trialNum_to_trialNum2(trialMblk, session)))

## Re-create SR1_groupRT_Mblk, this time adding a column for RT_lcurve:
#  (See the R cell named "groupRT_Mblk1" around line 180 above.)
SR1_groupRT_Mblk <- NULL  # remove the old data frame to prevent interference
SR1_groupRT_Mblk <- ddply(SR1_groupRT_data, .(session, trialMblk), function(D1) {
  data.frame(
    trialMblk2 = constant(D1$trialMblk2),
    RT_mean  = mean(D1$RT_mean),
    RT_lcurve = mean(D1$RT_lcurve),   # <-- new variable
    N_obs = nrow(D1) * constant(D1$N_sbj, verifyp=FALSE)   # number of observations
)})

## Sanity check: The number of observations in each miniblock should be 
#  120 = 15 subjects * 8 trials/miniblock = const
stopifnot(constant(SR1_groupRT_Mblk$N_obs) == PAR$N_sbj * PAR$miniblock_size)

## Scatter-plot the group-averaged RTs:
gp$RT_mean_scatter4 <- ggplot(SR1_groupRT_Mblk, 
  aes(x = trialMblk2, y = RT_mean, shape=session)) + 
  geom_point(color="#AAAAAA") +
  geom_line(aes(y = RT_lcurve, color=session), linewidth=1)
print(gp$RT_mean_scatter4)

## Plot the residuals:
SR1_groupRT_Mblk$residRT <- SR1_groupRT_Mblk$RT_mean - SR1_groupRT_Mblk$RT_lcurve
gp$RT_group_residuals2 <- ggplot(SR1_groupRT_Mblk,
  aes(x = trialMblk2, y = residRT, shape=session, color=session)) + 
  geom_point()
print(gp$RT_group_residuals2)

Re-Calculate the Second Trial-Level Exclusion Criterion

“./SR1_get_data2.Rmd” defined two sets of trial-level exclusion criteria. They are flagged in CR1_data$filter1 and filter2.

The first set of criteria (filter1) are uncontroversial. They are applied in the next cell. The second set, however, involve the z-transformed RTs based on the raw RT data. It is useful to re-calculate these z scores now that we are using residual RTs instead of raw RTs.

The first filter is applied here:

## Apply the FIRST (but not the SECOND) set of trial-level exclusion criteria:
D <- subset(SR1_data, filter1)  # Note: filter1 rather than filter2

## Sanity checks:
stopifnot(D$timeout==0)     # trials with RT>2000 have been excluded
stopifnot(D$acc==1)         # trials with incorrect responses have been excluded
stopifnot(D$filter1)

## Remove columns that have constant values:
D <- D[, setdiff(names(D), c("timeout", "acc", "filter1"))]

## Print a message on the console:
cat(sprintf("Extracted a dataframe D with %d rows and %d columns.\n",
            nrow(D), ncol(D)))

## Extracted a dataframe D with 22373 rows and 37 columns.

The second filter (filter2) is based exclusively on z-scores of the raw RT data. Trials with abs(zRT) > meta_data$zRT_threshold were excluded from the sample. “./SR1_get_data2.Rmd” used a threshold of 3.0, which eliminated 395 trials, leaving 21978 “good” ones.

Let’s apply the same general idea to the residual RT data. To that end, we replicate the processing from lines 420 to ~500 in “./SR1_get_data2.Rmd”:

## Calculate rRT_stats and store them in a dataframe:
rRT_stats <- ddply(D, .(subjNum, session), function(D1) {
  data.frame(
    subjNum = constant(D1$subjNum, verifyp=FALSE), 
    session = constant(D1$session, verifyp=FALSE), 
    rRT_mean = mean(D1$residRT),   # rather than mean(D1$residRT)
    rRT_std  = sd(D1$residRT)      # rather than sd(D1$RT)
)})

print(rRT_stats, digits = 3)   # in milliseconds from static-image onset

##    subjNum session rRT_mean rRT_std
## 1    10889   video  -1.3050   130.6
## 2    10889    ling   2.9122    58.3
## 3    19616   video  -9.3284   136.4
## 4    19616    ling   0.3976    79.0
## 5    30474   video  -3.3599   115.2
## 6    30474    ling   0.0552    86.6
## 7    30723   video  -7.3819   102.5
## 8    30723    ling  -1.4304    58.2
## 9    46991   video  -9.8203   127.6
## 10   46991    ling   2.5577    93.6
## 11   52502   video   0.7535   132.2
## 12   52502    ling -13.5553   111.5
## 13   53927   video   1.8614    95.6
## 14   53927    ling   1.1825    53.4
## 15   55464   video  -4.1095   144.5
## 16   55464    ling -15.5875   104.7
## 17   68030   video  -3.0021   147.7
## 18   68030    ling  -0.3650    74.9
## 19   69393   video  -0.2688   147.8
## 20   69393    ling -10.0805   149.3
## 21   71440   video   2.7865   106.9
## 22   71440    ling  -1.3914    73.5
## 23   79802   video  -3.9431   172.3
## 24   79802    ling  -6.9834   172.8
## 25   80116   video  -5.2667   140.9
## 26   80116    ling  -1.0383   154.2
## 27   87949   video  -4.1867   123.6
## 28   87949    ling  -0.0156    95.9
## 29   98309   video -14.1834   202.0
## 30   98309    ling  -1.9475    91.4

The means are relatively small because we are averaging residuals around the main trend as captured by the individual learning curves.

Also, most means are negative. This is because filter1 eliminated disproportionately more trials with long RTs rather than trials with short ones.

We need this information at the trial level. In the language of relational databases, we need to perform a JOIN operation using the pair (subjNum, session) as key to link the two databases. Conveniently, the plyr package implements this functionality. /* We could also use merge() from the base package but it tends to mess up the order of the rows in the resulting dataframe. */

SR1_data <- join(SR1_data, rRT_stats, by = c("subjNum", "session"))

We calculate the studentized RT and store it as a new variable “zrRT” (z-transformed residual RT).

We then define our third filter for excluding trials – we keep only the trials whose ab(zrRT) is less than meta_data$zRT_threshold.

The final filter is stored as variable “filter3” in the dataframe. It subsumes filter1 but not filter2.

## Add a new column to the dataframe: zrRT = studentize(residRT)
SR1_data$zrRT <- with(SR1_data, ((residRT - rRT_mean) / rRT_std))

## Add one final column: filter3
SR1_data$filter3 <- with(SR1_data, (filter1 & (abs(zrRT) <= meta_data$zRT_threshold)))

## Tabulate how many trials satisfy the three filters.
#  filter2 subsumes filter1:
with(SR1_data, table(filter1, filter2, useNA="ifany"))

##        filter2
## filter1 FALSE  TRUE
##   FALSE   667     0
##   TRUE    395 21978

#  filter3 also subsumes filter1:
with(SR1_data, table(filter1, filter3, useNA="ifany"))

##        filter3
## filter1 FALSE  TRUE
##   FALSE   667     0
##   TRUE    400 21973

#  Importantly, filter2 and filter3 are strongly but not perfectly correlated:
with(SR1_data, table(filter2, filter3, useNA="ifany"))

##        filter3
## filter2 FALSE  TRUE
##   FALSE  1027    35
##   TRUE     40 21938

## Sanity check: The filtered zrRT distribution should lie entirely in [-3.0, +3.0]
summary(subset(SR1_data, filter3)$zrRT)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.87300 -0.49651 -0.18535 -0.09161  0.17990  2.99125

Very interestingly, there are 40 cases that are passed by filter2 but blocked by filter3. These are points that are too far away from the learning curves, while simultaneously “hiding” in the bulk of the distribution.

Conversely, there are 35 cases that are passed by filter 3 but blocked by filter2. These are points – typically early during the video session – with large raw RTs that are accounted for by the upward bend of the learning curves.

We strongly recommend to deprecate filter2 and use filter3 instead. Thus, as of May 2024, the recommended method for retrieving the “good” data is by the following code:

## Recommended method for accessing the "good" trials for substantive analyses:
D <- subset(SR1_data, filter3)

Save as RData

Save the variables “SR1_data”, “SR1_sbj”, “meta_data”, and “model_fits” as a RData file. Note that this supercedes the similar RData file produced by “./SR1_get_data2.Rmd”.

## Remove the variable "trialMblk" from the main data frame. It is easy to
#  recompute whenever it's needed, with the added flexibility of choosing
#  the miniblock size.
SR1_data$trialMblk <- NULL

## Copy PAR$ref_trials to meta_data. It is an arbitrary choice related to the
#  exponential-curve fits. (This choice shouldn't matter, but just in case.)
meta_data$exp_ref_trials <- PAR$ref_trials

## Pathnames
meta_data$main_data2.RData <- "SR1_main_data2.RData"
full_filename <- file.path(meta_data$SeeRel1_root, meta_data$data_path,
                           meta_data$main_data2.RData)

## Save the data to disk. Print a message on the console:
save(list = c("SR1_data", "SR1_sbj", "meta_data", "model_fits"), 
     file = full_filename)
cat(sprintf("Saved a dataframe with %d rows and %d columns in %s.\n",
            nrow(SR1_data), ncol(SR1_data), full_filename))

## Saved a dataframe with 23040 rows and 43 columns in /Users/petrov.11/work/RelExper/SeeRel1/data/SR1_main_data2.RData.

## Tidy up the workspace
rm("full_filename")

Where to go from here: First, check out “./SR1_exploratory2.Rmd”. The publication-quality analyzes and figures will be produced by “./SR1_CogSci2024_final.Rmd”

End of script “SR1_learning_effect1.Rmd”.

SR1_learning_effect1.Rmd

Alex Petrov

2024-05-08, forking off SR1_exploratory1.Rmd

Executive Summary

Getting started

Orientation and Preliminary Exploration

`nls` Regression of the Group Data from the Video Session

`nls` Regression of the Group Data from the Video Session

Plot the Individual Learning Curves

Individual Nonlinear Fits

Model Comparison

Estimate the Individual Learning Curves

Plot Again the Individual Learning Curves

Re-Calculate the Second Trial-Level Exclusion Criterion

Save as RData

SR1_learning_effect1.Rmd

Alex Petrov

2024-05-08, forking off SR1_exploratory1.Rmd

Executive Summary

Getting started

Orientation and Preliminary Exploration

nls Regression of the Group Data from the Video Session

nls Regression of the Group Data from the Video Session

Plot the Individual Learning Curves

Individual Nonlinear Fits

Model Comparison

Estimate the Individual Learning Curves

Plot Again the Individual Learning Curves

Re-Calculate the Second Trial-Level Exclusion Criterion

Save as RData

`nls` Regression of the Group Data from the Video Session

`nls` Regression of the Group Data from the Video Session