A tutorial on generalized additive mixed-effects model (GAMM)
Preliminaries: The challenge of dynamic data
Phonetics used to concern mostly with static data, such as VOT, formant values at vowel midpoint, or F0 at vowel onset. But as computers are getting more and more powerful, it is common (almost a necessity) to analyze dynamic data. For example, instead of F0 at vowel onset, pitch contour during the voiced parts needs to be analyzed, and similarly, formant trajectories are analyzed instead of formant values at vowel midpoints.
There are in general two main approaches to analyze dynamic data. The first approach is to decompose dynamic data into simpler parameters. For example, Discrete Cosine Transform (DCT) can decompose functions into multiple coefficients, among which $k_0$ is related to mean/intercept, $k_1$ to slope, and $k_2$ to (parabolic) curve. Another approach is to model the dynamics itself either using polynomials (linear models) or the addition of multiple base functions (additive models). Linear regression uses polynomials to model dynamic data, while generalized additive models employs the addition of multiple base functions.
Different kinds of dynamic data
Dynamic data can be divided into different types, according to its complexity. The following lists several common functions.
$$\text{Linear: }y = ax + b$$ $$\text{Quadratic: }y = ax^2+bx+c$$ $$\text{Cubic: }y = ax^3+bx^2+cx+d$$ $$\text{Quartic: }y = ax^4+bx^3+cx^2+dx+e$$ $$\text{Exponential: }y = e^x$$ $$\text{Sine: }y = sin(x)$$
Linear functions are the easiest to analyze, because only an intercept and a slope are needed to capture a linear function. Linear mixed-effects models can be used to analyze linear functions.
Quadratic, cubic an quartic functions are more challenging, but an extension of linear mixed-effects models, Growth Curve Analysis (GCA, Mirman 2014) can be used to analyze such functions.
Exponential and sine functions, however, cannot be handled by GCA, because it cannot be decomposed into a linear component and non-linear components. However, some other statistical methods are able to model such non-linear data, including Generalized Additive Models (GAM) and Functional Data Analysis (FDA). In essence, GAM and FDA are the same in that they use the addition of basis functions to model complex curves.
Workflow of GAMM
Fitting GAMM in R is achieved using the mgcv
package and data visualization is usually achieved using the itsadug
1 package. GAMM can be fitted using the gam
function, but more often it is fitted using bam
for very large datasets.
library(mgcv)
library(itsadug)
The basic workflow of GAMM is as follows:
- Create an ordered factor for the factors of interest
- Fit the model with
gam()
orbam()
- Handling autocorrelation errors
- Model criticism
- Visualization of results
- Significance testing
The reason to change the factor into an ordered factor is that the constant difference (the parametric coefficients) and the non-linear difference can be distinguished (Wieling 2018:94). To do that, using the following code. Also, the factor needs to be changed into treatment coding.
data$factor <- as.ordered(data$factor)
contrasts(data$factor) <- "contr.treatment"
Another thing to note is that if several factors of interest are involved, it is not recommended to fit separate models for separate factors. A more informed solution is to construct a new variable based on the factors of interest (Wieling 2018:106). For example, if there are two factors of interest, tone
and generation
, the two factors can be combined to create a new factor toneGeneration
using the following code. The function interaction()
enables the order in the original factors to be preserved in the combined factor.
data$factor <- interaction(factor1, factor2)
GAMMs can be fitted with formulae structures similar to those used in lme4
. For parametric coefficients, simply include the factor, but for smooth factors, you need to surround the factors with s()
(means the smooths) and some additional parameters need to be specified. Within each smooth, there are several things you can change. The parameter bs=
specifies the types of basis spline functions, and it can take values such as tp
(the default thin plate regression spline), cr
(cubic regression spline). Another parameter k
can be specifed and it stands for how many basis functions are needed. The default value of k
is 10 and for pitch values within one syllable, there are usually only ten values, and k
can be set to $9=10-1$.
model <- bam(f0 ~ tone + # the parametric coefficient
s(time) + # the smooth term
s(time, by = factor) + # the smooth term by tone
s(time, speaker, bs = "fs", m = 1), # random smooth
data = data,
rho = rho1, # specify the autocorrelation error factor
AR.start = data$start.event,
discrete = TRUE) # use this to speed up model fitting
GAMM, as the second M stands for mixed-effects, is also able to model random effects 2. Three types of random effects can be included, random intercept, random slope and random smooth. The formulae below shows the formalization in mgcv
. As in fixed effects, bs
stands for basis spline functions. re
stands for random effects, and fs
stands for factor smooth, in which a separate smooth is fitted for each level in the random factor. The nonlinear penalty is specified using the parameter m
, and when it equals 1, it means that the first derivative is penalized.
s(speaker, bs = "re") # random intercept by speaker
s(speaker, factor, bs = "re") # random slope of tone by speaker
s(time, speaker, bs = "fs", m = 1) # random smooth of tone by speaker
Of course, it is possible to fit smooth on two dimensions, and this is achieved by te()
(a full tensor product) or ti()
(a tensor product interaction) in the formulae. The third case study in Chuang et al. (2021) is a friendly hands-on tutorial on this kind of analysis.
Handling autocorrelation
One kind of error in time series data which is usually ignored is the autocorrelated error (Baayen et al. 2018). It refers to the correlation between the measuring points. Autocorrelated error can be visualized as a function of lag, i.e., the correlation between the present point with the previous n points. When lag is 0, it refers to the correlation of the point with itself and therefore is 1. When lag is 1, it refers to the correlation of the point with the immediately preceding point, and it is also called AR(1)
, the autocorrelated residual at lag 1. This is the value usually used in fitting GAMMs with autocorrelated errors taken into consideration, and it is called the AR1 error model.
Before specifying AR(1)
, the data should be revised to specify the starting point of each time series, using the start_event()
function in itsadug
.
data <- start_event(data, event= "factor")
After that, GAMM can be fitted on the data first without autocorrelated errors. This model can be used to get the AR(1)
value and this value can be used to fit an AR1 error model with the same model structure as the previous one. AR1 error can be get using two functions, as shown in the following code. The value can then be fed into the gam()
or bam()
function by specifying the rho
parameter.
r1 <- start_value_rho(model)
r1 <- acf_resid(model)[2]
Model criticism
Like linear models, GAMMs also need to check their residual values. The gam.check()
function provides important dignosis information on the model. It generates four plots. The first plot is a normal quantile plot. This plot can be used to diagnose whether the model residuals follow a normal distribution. If the points all fall on the red line, it shows that the residuals follow a normal distribution. Sometimes, the residuals can have two heavy tails, suggesting a t-distribution. In this scenario, the data should be fitted instead with a scaled-t distribution, by specifying family="scat"
(Chuang et al. 2021:18). The histogram of residuals can also be used to help diagnose residual distribution.
The two scatter plots can be used to assess heteroscedasticity, which refers to the unequal variance depending on the values of the predictors or the fitted values. If the variability is correlated with the predictors or fitted values, the homoscedasticity assumption is said to be violated.
Visualization with GAMM
The most popular way to visualize the data using GAMM is to plot the fit and the difference curve of the data. For the interaction of two numerical variable, heat map is also usually used. Visualization with GAMM is usually achieved using the itsadug
package. It is also possible to extract the data and visualize it in ggplot2
, which enables finer control over aesthetics and layout 3.
GAMM results are usually visualized with smooth plots and difference curves. Smooth curves plot the model predicted smooths, while difference curves plot the difference between the curve of one level and the one of the reference level within a specific factor.
The plot_smooth()
function in itsadug
can be used to plot smooth curves, and the basic usage is exemplified as below. To plot the smooth curves, a GAMM object is needed and then you need to choose on which factor the smooth curves are on. The plot_all
parameter specifies the factor you want to plot. Model summary can be printed with print.summary
equals TRUE
, and otherwise with FALSE
. The random effects can be included by setting rm.ranef
to FALSE
, and removed by setting it to TRUE
.
plot_smooth(model, view = "time",
plot_all = "factor",
print.summary = FALSE,
rm.ranef = FALSE)
Apart from smooth curves, difference curves can also be plotted using the plot_diff()
function. As plot_smooth()
, a model and a smooth factor need to be specified, and it also requires a comparison factor. The comp
parameter specifies which factor levels you want to coompare. The resultant figure is a curve with a horizontal zero line. If the shaded area (the 95% confidence interval) contains the zero line, then the difference can be said to be not significant, and otherwise, it is said to be significant.
plot_diff(model, view = "time",
comp = list(factor = c("level1", "level2")),
print.summary = FALSE)
These functions can also be used to get the data and to plot it in ggplot2
(of course, get model prediction is a better approach). The model fit data can be accessed using the following codes. Another alternative is to use functions in tidymv
and a more recent tidygam
package.
# get model fit data
fitData <- plot_smooth(model, view = "time",
plot_all = "factor",
print.summary = FALSE,
rug = FALSE, rm.ranef = TRUE)$fv
Significance testing with GAMM
Factors can influence various aspects of dynamic data. Dynamic data can differ in intercept only, slope only, and also in the overall shape. Also, it is possible that curves can be different in some part but not in others. Although visualization provides a more straightforward way to interpret the results, significance testing is a necessity for confirmatory studies in particular. Significance testing with GAMM can be achieved by summary statistics, model comparison and visual methods.
Sóskuthy (2021) evaluated the random effect structures and significance testing in GAMM. Like all other models, GAMM results can be summarized with summary statistics. The summary of GAMM can be divided into two parts, the parametric coefficients showing the difference in intercept (overall height of the curve) and the smooth terms showing the difference in smooth. t
-tests are used to get p-values for parametric coefficients, while F-tests are used for smooth terms. If there is a significant effect in parametric coefficients, it indicates that the curves are different in overall height due to predictors. On the other hand, if there is a significant effect in smooth terms, it is said that the curves have difference shapes due to predictors.
Apart from testing significance using model summaries, it is also possible to achieve it using model comparison. This can be done using the compareML()
function in itsadug
, which is the same in essence with the anova()
function used to compare linear mixed-effects models.
Further reading
Baayen, R. H., van Rij, J., de Cat, C., & Wood, S. (2018). Autocorrelated Errors in Experimental Data in the Language Sciences: Some Solutions Offered by Generalized Additive Mixed Models. In D. Speelman, K. Heylen, & D. Geeraerts (Eds.), Mixed-Effects Regression Models in Linguistics (pp. 49–69). Springer International Publishing. https://doi.org/10.1007/978-3-319-69830-4_4
Introduces the problem of autocorrelated error in dynamic data analysis and provides a solution by including an autoregressive AR(1)
process for the errors. The solution is to first fit a model without autocorrelation and get the autoregressive proportionality. It is specified using the rho
parameter and another model is fitted with this parameter.
Chuang, Y.-Y., Fon, J., Papakyritsis, I., & Baayen, H. (2021). Analyzing Phonetic Data with Generalized Additive Mixed Models. In M. J. Ball (Ed.), Manual of Clinical Phonetics (1st ed., pp. 108–138). Routledge. https://doi.org/10.4324/9780429320903-10
Provides three case studies, which demonstrates the full capacity of GAMM in analyzing linguistic data.
Sóskuthy, M. (2021). Evaluating generalised additive mixed modelling strategies for dynamic speech analysis. Journal of Phonetics, 84, 101017. https://doi.org/10.1016/j.wocn.2020.101017
A critical paper on random effect structure and significance testing using GAMM.
Wieling, M. (2018). Analyzing dynamic phonetic data using generalized additive mixed modeling: A tutorial focusing on articulatory differences between L1 and L2 speakers of English. Journal of Phonetics, 70, 86–116. https://doi.org/10.1016/j.wocn.2018.03.002
A step-by-step hands-on tutorial of GAMM using phonetic data.
Wood, S. N. (2017). Generalized additive models: An introduction with R (Second edition). CRC Press Taylor & Francis Group.
A comprehensive book on Generalized additive models, with many mathematical and implementation details.
van Rij, J. (2015). Overview GAMM analysis of time series data. https://jacolienvanrij.com/Tutorials/GAMM.html
An online tutorial by van Rij.
Sóskuthy, M. (2017). Generalised additive mixed models for dynamic analysis in linguistics: A practical introduction (arXiv:1703.05339). arXiv. https://doi.org/10.48550/arXiv.1703.05339
A preprint with more detailed guide on GAMM.
-
van Rij, J., Wieling, M., Baayen, R. H., & van Rijn, H. (2020). itsadug: Interpreting time series and autocorrelated data using GAMMs. ↩︎
-
See the following paper for random effects in linear mixed-effects models: Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001 ↩︎
-
See also the following link on using
tidymv
to plot GAMM results: https://cran.r-project.org/web/packages/tidymv/vignettes/plot-smooths.html. ↩︎