Incorporating interactions into structured life course modelling approaches: A simulation study and applied example of the role of access to green space and socioeconomic position on cardiometabolic health

Background: Structured life course modelling approaches (SLCMA) have been developed to understand how exposures across the lifespan relate to later health, but have primarily been restricted to single exposures. As multiple exposures can jointly impact health, here we: i) demonstrate how to extend SLCMA to include exposure interactions; ii) conduct a simulation study investigating the performance of these methods; and iii) apply these methods to explore associations of access to green space, and its interaction with socioeconomic position, with child cardiometabolic health. Methods: We used three methods, all based on lasso regression, to select the most plausible life course model: visual inspection, information criteria and cross-validation. The simulation study assessed the ability of these approaches to detect the correct interaction term, while varying parameters which may impact power (e.g., interaction magnitude, sample size, exposure collinearity). Methods were then applied to data from a UK birth cohort. Results: There were trade-offs between false negatives and false positives in detecting the true interaction term for different model selection methods. Larger sample size, lower exposure collinearity, centering exposures, continuous outcomes and a larger interaction effect all increased power. In our applied example we found little-to-no association between access to green space, or its interaction with socioeconomic position, and child cardiometabolic outcomes. Conclusions: Incorporating interactions between multiple exposures is an important extension to SLCMA. The choice of method depends on the researchers’ assessment of the risks of under- vs over-fitting. These results also provide guidance for improving power to detect interactions using these methods.

For the binary 'overweight' outcome, we simply recoded the BMI variable into overweight or not, based on a BMI > 25.
We highlight again that this simulated data is purely to illustrate the logic and application of these structured life-course methods, and should not be taken as a reflection of the real-world patterns or effect sizes of these variables.

Section S2: Details of data for formal simulation study
This section provides details for the data generated in formal simulation datasets used in the second section of the paper to explore how well these structured life course methods when including interaction terms perform, given varying parameters, in detecting the true interaction term. The causal relations between the exposures are the same as described in the 'Worked Example' section of the main text. Simulating the outcome BMI depended on the life course hypothesis being assessed, i.e., an interaction between SEP and either critical period at time 3 (as in the worked example), accumulation or change from time 2 to 3. The factors we varied were: • Sample sizes: comparing n = 1,000 vs n = 10,000 For all simulations, we used a binary socioeconomic position (SEP) covariate, as described in section S1 of this supplementary material:

Exposure variables (access to green space)
Binary exposure variables with low collinearity (defined as an unadjusted correlation of approximately 0.3 between adjacent exposures) were generated as described in section S1 of this supplementary material: Continuous exposure variables with low collinearity (defined as an unadjusted correlation of approximately 0.5 between adjacent exposures) were generated as follows:

Outcome variables (BMI/overweight)
Continuous outcome variables with binary exposures, where critical period at time 3 and its interaction with SEP caused BMI, were generated as described in section S1 of this supplementary material: To assess different strengths of the interaction, we varied the gamma parameter ( ), at values of 0 (no interaction), 0.5 (very small interaction), 1 (small interaction), 2 (moderate interaction), 3 (large interaction) and 4 (very large interaction).
Continuous outcome variables with continuous exposures, again where critical period at time 3 and its interaction with SEP caused BMI, were generated as follows (the 'green3' parameter was chosen to be 0.02 BMI per unit increase in green space, as the continuous 'green3' standard deviation is ~50, and two times this should cover most of the variation in green space distance [0.02 * 50 * 2 = 2], making results broadly comparable to the binary green space effect of 2 BMI units (Gelman 2008)): To assess different strengths of the interaction, we varied the gamma parameter ( ), at values of 0 (no interaction), 0.005 (very small interaction), 0.01 (small interaction), 0.02 (moderate interaction), 0.03 (large interaction) and 0.04 (very large interaction).
Continuous outcome variables with binary exposures, where accumulation and its interaction with SEP caused BMI, were generated as follows (with accumulation encoded as in To assess different strengths of the interaction, we varied the gamma parameter ( ), at values of 0 (no interaction), 0.005 (very small interaction), 0.01 (small interaction), 0.02 (moderate interaction), 0.03 (large interaction) and 0.04 (very large interaction).
Continuous outcome variables with binary exposures, where change from time 2 to 3 (using an increase in access to green space between these time points) and its interaction with SEP caused BMI, were generated as follows (with an increase in green space between times 2 and 3 encoded as in To assess different strengths of the interaction, we varied the gamma parameter ( ), at values of 0 (no interaction), 0.5 (very small interaction), 1 (small interaction), 2 (moderate interaction), 3 (large interaction) and 4 (very large interaction).
For binary 'overweight' outcome variables, we again simply recoded the BMI variable into overweight or not, based on a BMI > 25.

Study population
The Avon Longitudinal Study of Parents and Children (ALSPAC) is a prospective birth cohort study in which pregnant women resident in Avon, UK with expected dates of delivery between 1 st April 1991 and 31 st December 1992 were invited to take part. The initial number of pregnancies enrolled is 14,541; of these initial pregnancies there were a total of 14,676 fetuses, resulting in 14,062 live births and 13,988 children who were alive at 1 year of age Fraser et al. 2013). These mothers, their partners, and their index children have been followed with regular questionnaire and clinic assessments since this time. When the oldest children were approximately seven years of age, an attempt was made to bolster the initial sample with eligible cases who had failed to join the study originally. The total sample size for analyses using any data collected after the age of seven is therefore 15,447 pregnancies, resulting in 14,901 children alive at 1 year of age. After removing children or their mothers who had withdrawn consent for their data to be used, the final sample in this analysis consists of 14,845 children.
Please note that the study website contains details of all the data that is available through a fully searchable data dictionary and variable search tool (http://www.bristol.ac.uk/alspac/researchers/our-data/).

Variable Selection
Outcome variables: Our outcome variables were BMI, obesity (BMI ≥ 85 th centile [z-score of 1.036] from the age and sex standardised 1990 UK growth reference charts (Cole et al. 1995)) and systolic and diastolic blood pressure assessed at a follow-up research clinic when the study children were approximately 7 years of age.
Exposure variables: ALSPAC is part of the LifeCycle project EU Child Cohort Network, an initiative to harmonise data across a network of European birth/childhood cohort studies to investigate how early-life stressors impact health across the lifespan (Jaddoe et al. 2020). Part of this harmonisation process includes repeated measures of the built environments, such as access to green space, which will be used as our exposures for the present study. We used the LifeCycle variable 'whether there is a green space >5,000 m 2 within 300 meters of the geocoded family home address' as our binary access to green space exposure. This definition is frequently used by policy makers to define 'residential proximity to a major green space', and is approximately a 5 minute walk (https://www.euro.who.int/__data/assets/pdf_file/0010/342289/Urban-Green-Spaces_EN_WHO_web3.pdf). The European 'Urban Atlas' database, produced by the European Environmental Protection Agency (2010), based on green space data from 2006 was used to determine the location of green spaces, its size, and its distance from the family home (https://www.eea.europa.eu/data-and-maps/data/urban-atlas). The Barcelona Institute for Global Health (ISGlobal) linked the Urban Atlas green space layers with family geocoded addresses in a centralised Geographic Information System (GIS) environment to generate the green space exposure variable and harmonise it across cohort studies (de Castro Pascual et al. 2021). The ALSPAC Data team then received and prepared the data for release. In this study, we used three measures of this binary 'access to green space' exposure; in pregnancy (time 1), when the child was aged 4 years of age (time 2), and when the child was aged 7 years of age (time 3).

SEP Confounder/Interaction:
We used highest parental education as our SEP exposure, with A-levels (optional qualifications taken at approximately age 18 year) or higher coded as 'high' and O-level (qualifications taken at approximately age 16 years) or lower coded as 'low'. This variable was measured in pregnancy.
Other confounders/covariates: Based on our hypothesised DAG (figure S2), the only factor that may confound these associations is ethnicity, so this variable was included in all analyses (coded as White vs other than White). Other covariates which are not necessarily confounders, but may help predict the outcomes and remove residual variation, which were included in analyses were the sex of the child and age of the child at the time of outcome measurement.

Analysis
We followed the same SLCMA detailed above for the simulated data, analysing the data using the same three methods: visual inspection, a 'relaxed lasso' approach (using both AIC and BIC to identify the best-fitting models), and cross-validated lasso (using both the minimum prediction error and within 1 standard error of this value to select the best-fitting models). Each set of SLCMA models was repeated for each of the four outcomes, using a linear model for continuous outcomes (BMI, SBP and DBP) and a logistic model for binary outcomes (overweight/obesity). As all exposures were binary, hypotheses were encoded as described in table 2. All exposures were mean-centered in an attempt to reduce the correlation between the exposure and the interaction term and increase power. To avoid collinearity between encoded variables, variables with a correlation coefficient greater than r = 0.9 with other variables were removed from the set of hypotheses to be tested. In all models this resulted in the 'accumulation' hypothesis (and hence also the 'accumulation interaction' term) being omitted due to collinearity with the critical period variables. Simple Change in green space from 2 to 3 with SEP interaction green_ch23 + green_ch23_int Compound Table S2: Summary of overall simulation results for each SLCMA method, each interaction strength, and each life course interaction model, split by continuous and binary outcomes (n simulations = 16; n iterations per simulation = 1,000). Results show the percent of simulations in which the correct interaction term was selected. Note that in the rows for interaction strength of 'None', technically this is not the 'correct' interaction as no interaction was present (i.e., 8.4% of simulations using the AIC relaxed lasso method with a continuous outcome included the interaction term for the critical period at time 3 model, even though no interaction was simulated). For full details on the simulation parameters, see section S2 of the supplementary information. Note that models with binary outcomes and an interaction strength of 'none' were more likely to include the interaction term compared to an interaction strength of 'very small'; this is because the dichotomisation of the continuous BMI outcome made the interaction effect slightly negative, compared to the original interaction term with the continuous outcome (e.g., where the true interaction is null for the continuous outcome it is slightly negative for the binary outcome, while where the true interaction is very small for the continuous outcome, it is practically null for the binary outcome). SE = Standard error; MSE = Mean-squared error.      Perhaps weak evidence for a decrease in green space between pregnancy and age 4 associated with outcome (figure S7) One encoded variable included beyond baseline confounders/ covariates (decrease in green space between pregnancy and age 4) b

No variables beyond baseline confounders/ covariates
No variables beyond baseline confounders/ covariates One encoded variable included beyond baseline confounders/ covariates (decrease in green space between pregnancy and age 4) b a Results not displayed here, but they are very similar to the relaxed AIC results for systolic blood pressure described in table S8 (i.e., of the encoded variables, only a decrease in green space between pregnancy and age 4, and its interaction with socioeconomic position, were associated with the outcome). b Post-selection inference suggested that a decrease in green space between pregnancy and age 4 was associated with a reduction in diastolic blood pressure (b = -1.209, SE = 0.443, 95% confidence interval = -2.084 to -0.285, p = 0.006). Table S8: Assessing the ALSPAC result in detail where the relaxed AIC lasso for systolic blood pressure outcome found 5 encoded variables (beyond the baseline confounders/covariates) in the best-fitting model. These encoded variables were: 'int3' (interaction between SEP and critical period at age 7); 'green_dec12' (decrease in green space between pregnancy and age 4); 'green_dec12_int' (interaction between SEP and decrease in green space between pregnancy and age 4); 'green_inc_23' (interaction between SEP and increase in green space between ages 4 and 7) and; 'green_dec23' (decrease in green space between age 4 and 7). To explore these results in more detail, we performed post-selection inference to estimate unbiased confidence intervals and p-values for this model (for more details on this method, see: (Tibshirani et al. 2016;Smith et al. 2022)). As some of these selected encoded variables were interaction terms without a main effect, we compared results both excluding and including these additional main effects. These results suggest that the only robust effect is that of decrease in green space from pregnancy to age 4 and its interaction with SEP. These models indicate that a decrease in access to green space between pregnancy and age 4 is associated with lower systolic blood pressure, but only among those from low SEP backgrounds (with systolic blood pressure being approximately equivalent for those who did not experience a decrease in access to green space between this time, and for those who did experience a decrease but came from high SEP backgrounds). This association is counter-intuitive, given that a decrease in access to green space may be expected to worsen -not improve -blood pressure. Given this, the complexity of the model, the minimal improvement in model fit (figure S6), the fact that other SLCMA models with this outcome did not detect this association (e.g., the relaxed BIC and cross-validated 1SE models), and the lack of similar effects in other literature, this would perhaps suggest that this result may be largely due to random noise in the data rather than a meaningful biological effect. SE = Standard error; CI = Confidence interval.

Variable
Without additional main effects (i.e., the model selected by the lasso)  Figure S1: Example of the output from the cross-validated lasso using the simulated example dataset. The log-lambda value is displayed on the x-axis, with decreasing lambda values going from right to left. The mean-squared error (y-axis) is a measure of out-of-sample prediction error, based on splitting the data into k equal portions (here, k = 10) and calculating the mean prediction error for each lambda value. The left dotted vertical line denotes the lambda value which minimises the prediction error, while the right dotted vertical line is the model where the prediction error is no more than one standard error above the minimum mean-squared error value. The number of parameters included in each model is given above the plot. In the one standard error model, there are three variables included: the SEP variable ('high_sep'; included by default), the most recent critical period variable ('crit3'), and the interaction between SEP and the most recent critical period ('int3'), just as we simulated. The model with the lowest prediction error contains these three parameters, plus nine additional encoded hypotheses. Figure S2: Hypothesised Directed Acyclic Graph (DAG) for the ALSPAC data. Note that for simplicity only one 'green space' node, representing all three measured time-points, has been included here. We are also assuming that there may be a potential interaction between SEP and access to green space on cardiometabolic outcomes, which is not explicitly represented in this DAG. SEP = Socioeconomic position. Figure S3: Sankey plot displaying the change in binary access to green space (>5,000 m 2 within 300 meters of home address) at each time-point (n = 11,799). Figure S4: Plot of the lasso model with binary access to green space >5,000m within 300m of home in pregnancy (time 1) and at age 7 (time 2) as the exposures, BMI at age 7 as the continuous outcome, and highest parental education as the SEP-interaction term (n = 6,013). As the variables are added to/removed from the model they appear on the x-axis (with "(+)" indicating addition, and "(-)" indicating removal). The covariates/confounders constrained to be included in all models by default do not appear in this plot (SEP, maternal ethnicity, child sex and child age at outcome measurement). Unlike figure 4 in the main text, this model only includes exposures from pregnancy and age 7 (i.e., excluding age 4); results are qualitatively identical, however (i.e., little/no association between access to green space and child BMI; even though 'int2' [interaction between SEP and critical period at age 7]] was entered before all other variables, the improvement in model fit is minimal, ~0.02 percentage point improvement in the deviance ratio before the next variables were added). This result was confirmed by the relaxed lasso (AIC and BIC) and cross-validated lasso (minimum MSE and 1SE of MSE), which all selected the covariate-only model as the best fit to the data. Due to collinearity with the critical period variables, the 'accumulation' hypothesis (and hence also its interaction with SEP) was dropped from this model. See figure 1 for a detailed explanation of how to interpret this plot, and table 2 for information on what each of the encoded variables mean. Figure S5: Plot of the lasso model with binary access to green space >5,000m within 300m of home in pregnancy (time 1), at age 4 (time 2) and at age 7 (time 3) as the exposures, overweight/obese at age 7 as the binary outcome, and highest parental education as the SEP-interaction term (n = 6,013). As the variables are added to/removed from the model they appear on the x-axis (with "(+)" indicating addition, and "(-)" indicating removal). The covariates/confounders constrained to be included in all models by default do not appear in this plot (SEP, maternal ethnicity, child sex and child age at outcome measurement). There appears to be little association between any of the green space hypotheses and the outcome, with model fit (deviance ratio) barely increasing as variables enter the model (see table S7 for a more formal test using relaxed and cross-validated lasso approaches). Due to collinearity with the critical period variables, the 'accumulation' hypothesis (and hence also its interaction with SEP) was dropped from this model. See figure 1 for a detailed explanation of how to interpret this plot, and table 2 for information on what each of the encoded variables mean. Figure S6: Plot of the lasso model with binary access to green space >5,000m within 300m of home in pregnancy (time 1), at age 4 (time 2) and at age 7 (time 3) as the exposures, systolic blood pressure at age 7 as the continuous outcome, and highest parental education as the SEP-interaction term (n = 5,918). As the variables are added to/removed from the model they appear on the x-axis (with "(+)" indicating addition, and "(-)" indicating removal). The covariates/confounders constrained to be included in all models by default do not appear in this plot (SEP, maternal ethnicity, child sex and child age at outcome measurement). There appears to be little association between any of the green space hypotheses and the outcome, with model fit (deviance ratio) barely increasing as variables enter the model (see table S7 for a more formal test using relaxed and cross-validated lasso approaches). Due to collinearity with the critical period variables, the 'accumulation' hypothesis (and hence also its interaction with SEP) was dropped from this model. See figure 1 for a detailed explanation of how to interpret this plot, and table 2 for information on what each of the encoded variables mean. Figure S7: Plot of the lasso model with binary access to green space >5,000m within 300m of home in pregnancy (time 1), at age 4 (time 2) and at age 7 (time 3) as the exposures, diastolic blood pressure at age 7 as the continuous outcome, and highest parental education as the SEP-interaction term (n = 5,918). As the variables are added to/removed from the model they appear on the x-axis (with "(+)" indicating addition, and "(-)" indicating removal). The covariates/confounders constrained to be included in all models by default do not appear in this plot (SEP, maternal ethnicity, child sex and child age at outcome measurement). There appears to be little association between any of the green space hypotheses and the outcome, with model fit (deviance ratio) barely increasing as variables enter the model, although as 'green_dec12' (a decrease in green space between pregnancy and age 4) was added before all other variables, perhaps there is weak evidence that this is associated with the outcome (see table S7 for a more formal test using relaxed and cross-validated lasso approaches). Due to collinearity with the critical period variables, the 'accumulation' hypothesis (and hence also its interaction with SEP) was dropped from this model. See figure 1 for a detailed explanation of how to interpret this plot, and table 2 for information on what each of the encoded variables mean.