Dynamic Prediction of Survival in Cystic Fibrosis

Supplemental Digital Content is available in the text.

eAppendix 1. Creation of landmark data sets eFigure 1 illustrates how the landmark data sets arose. An individual was included in the landmark data set at age if they met all of the following criteria: They reached age between 1 st January 2005 and 31 st December 2015.
They joined the Registry prior to reaching age . The date of joining the Registry is the date of the first annual review at which data were obtained. They were diagnosed with CF prior to reaching age . They have not received an organ transplant of any type prior to reaching age . They have measures of all time-dependent variables recorded prior to reaching age .
We refer to an individual as "eligible for the th landmark data set" if she/he satisfied these five conditions. eTable 1 summarises the landmark data sets in terms of number of individuals, number of deaths within 2, 5 and 10 years of the landmark age, and number of censorings.

Time scale and follow-up
In all models the time origin is date of birth and analyses are performed using left-truncation at the landmark age. The censoring time was the earliest of death, 31 st December 2015 and a specified time horizon . Since dates of birth and death were only available in month/year format, the day was imputed as the 15 th of the month. For example, an individual aged 18 on 1 st January 2005 (who has been diagnosed, joined the Registry, and not received a transplant) contributes up to 11 years of follow-up until the end of 2015 to the landmark data set for age 18 and up to 10 years of follow-up for the landmark dataset for age 19 (if they do not die or have a transplant between ages 18 and 19), and so on. An individual aged 18 on 1 st January 2014 contributes up to 2 years of follow-up to the landmark data set for age 18 and up to 1 year of follow-up for the landmark dataset for age 19. The UK CF Registry aims to capture deaths from all causes. Of the 931 deaths used in this study, 775 (83.2%) were due to respiratory or cardiorespiratory failure, 55 (5.9%) were transplantation-related, 13 (1.4%) were due to liver disease or failure, 9 (1.0%) were due to cancer, 9 (1.0%) were due to trauma or suicide, 34 (3.7%) were due to "other causes" (recorded in a separate field and including "End state cystic fibrosis" and "Haemoptysis"), 35 (3.9%) were due to an unknown cause, and for 1 individual the cause was not recorded.
We assumed that all deaths are captured and the main results presented assume censoring is entirely administrative. In a sensitivity analysis we treated individuals not recorded at an annual follow-up for over 2 years as lost-to-follow-up. This did not materially alter the resultsthe C-indexes for 2-5-and 10-year survival from the final model (Model 2) were 0.874, 0.847, 0.807 respectively, and corresponding Brier scores were 0.036, 0.075, 0.130.

Landmark survival models
We let denote the vector of baseline predictors (sex, genotype and age of diagnosis) and denote the vector of the last-observation-carried-forward (LOCF) values for time-dependent predictors (calendar year, FEV%, FEV%, weight, height, CFRD. pancreatic insufficiency, Pseudomonas aeruginosa, Burkholderia cepacia, Staphylococcus aureus, Methicillin-resistant Staphylococcus aureus (MRSA), non-IV hospitalization, number of IV days) at landmark age .

Model 1 for the log conditional hazard is
where is the baseline hazard at age conditional on eligibility for the th landmark data set, and and are vectors of log hazard ratios specific to landmark age . Model 1 is in fact models, which are fitted in each landmark data set .
Model 2 for the log conditional hazard is where is again the baseline hazard at age conditional on eligibility for the th landmark data set . and are vectors of log hazard ratios, which are assumed to be the same for all . Model 2 therefore allows a separate baseline hazard from each landmark age, but common predictor coefficients across all landmark ages. It is fitted in the stacked data set using Cox regression with a stratified baseline hazard. 1,2 We note that for Models 1 and 2, using age as the time scale or timesince-landmark as the timescale are exactly equivalent. Models 1 and 2 make the proportional hazards assumption that the association of the predictors and with the hazard is the same over time since , i.e. that the and parameters are not timedependent. Models 1 and 2 were initially fitted using a time horizon of 10 years ( ), which enables us to obtain predicted survival probabilities for any time up to 10 years. We also investigated whether 2-year and 5-year survival could be better predicted by using a shorter time horizon by fitting Models 1 and 2 using and respectively.
Model 3 extends Model 2 by allowing the log hazard ratios to depend on in a smooth way: where and denote vectors of log hazard ratios that are functions of . We considered linear forms and and restricted cubic spline forms with knots at 18, 30, 40 and 50. The results reported in Table 3 of the main text are from the analysis using  the linear form for , as using restricted cubic splines did not materially improve predictive performance.
In Model 4 the supermodel was extended to allow time-varying coefficients, with the association between the predictors and the hazard dependent on time-since landmark : where and denote vectors of log hazard ratios that are functions of . We considered linear forms and and restricted cubic spline forms with knots at . The results reported in Table 3 of the main text are from the analysis using the linear form for , as using restricted cubic splines did not materially improve predictive performance.
Model 5 uses an overall baseline hazard instead of separate baseline hazards for each landmark age, with the impact of landmark age modelled using regression terms: where is a common baseline hazard and is a function of landmark age. We used a restricted cubic spline form for with knots at 18, 30, 40 and 50.
In Model 6 we extended Model 2 by adding the fitted values and slopes from the multivariate mixed model (see below) for FEV%, FVC% and weight to the set of time-dependent predictors at each landmark age: where denotes the vector of predicted values and slopes for FEV%, FVC% and weight from the multivariate mixed model. All models were fitted by maximum partial likelihood.

Multivariate mixed model
A multivariate linear mixed model for FEV1%, FVC%, BMI and weight was fitted to the repeated measures up to landmark age for individuals in the landmark data set at age . Separate models were fitted for each landmark age. The longitudinal variables were modelled as a linear function of age with a random intercept and slope. We also included fixed effects of all the other predictors, including both baseline and time-dependent predictors. For each individual in landmark dataset ( ) the individual fitted values and slopes for FEV1%, FVC% and weight at age were obtained. The numbers of longitudinal measurements used in the multivariate mixed models are summarised in eTable 2.

Predicted survival probabilities
From each model the predicted survival probability to time after the landmark age, conditional on survival to the landmark age, on baseline variables and on values of time-dependent predictors at the landmark age , , was obtained using the relationship For models without time-varying hazard ratios (Models 1-3 and 5-6) we used the estimator: where denotes the baseline hazard at time estimated from the increments in Breslow's estimate of the cumulative baseline hazard and the sum is over event times. 3 For Model 4, which has timevarying hazard ratios, we used the estimator eAppendix 3. Model assessment

Overview
Models were assessed and compared based on the "3-in-1" procedure described by Yong et al (2013), which incorporates model building using cross-validation, final model choice, and statistical inference. 4 The data were first divided into a "training+validation" (TV) set and a "holdout" set. The TV set is used in the model development and assessment. The holdout set is reserved for applying the selected model at the end. No models are fitted using the holdout data. The TV set is a sample of 80% from the stacked data, stratified by landmark age. The holdout set is formed from the remaining 20% of individuals at each landmark age. Some individuals appear in both the TV and holdout stacked data sets, but not with the same landmark age.
For model assessment we used the C-index, 5-8 the Brier score, 9,10 and percentage reduction in the Brier score relative to the null model (i.e. the model excluding all predictors, using Kaplan-Meier estimates). 11 The C-index and Brier scores were obtained using inverse probability of censoring weights. For Model 4 we accommodated the time-varying coefficients into the estimation of the C-Index and Brier score. 8 A Monte-Carlo cross-validation procedure was used within the TV data set to avoid over-optimism due to overfitting 12 . The procedure was as follows: (i) An 80% stratified random sample, with stratification by landmark age , was obtained from the TV data set.
(ii) The model was fitted on the 80% sample. (iii) The fitted model was used to obtain predicted survival probabilities to a given time from each landmark age (see below) for the 20% not in the sample. (iv) Model performance measures (C-index, Brier score, and percentage reduction in the Brier score) were obtained in the 20% not in the sample on which the model was fitted.
(v) Steps (i)-(iv) were repeated 200 times and we obtained the average C-index, Brier score and Brier score reduction across the 200 samples.
Model assessment measures were obtained for 2-year, 5-year and 10-year survival from each landmark age. Therefore there are 99 averaged C-indices and Brier scores for each model ( , where 33 is the number of landmark ages 18-50). For each model we also obtained an overall C-index and Brier score which are not age-adjusted. Further details are given below. To simplify the notation we give the details of the C-index and Brier score as if applied to the complete stacked data (the TV and holdout data combined).

Truncated C-Index
The following description of the C-index follows that of Gerds et al.. 7 Let and denote respectively the event time and censoring time for individual . We observe and the event indicator . Let denote the estimated probability of survival beyond age conditional on survival to age and given predictor values at age . The truncated C-index is where the expectation is with respect to two subjects , both alive at age . Not all pairs of individuals are comparable. We can compare two individuals who both have the event prior to age ; two individuals, one of whom has the event prior to age and the other of which is known to be alive (censored) at age . We cannot compare two individuals who are both known to be alive (censored) at age , two individuals both censored before age , or a pair in which one individual has the event and the other is censored before the other's event time. The fact that not all pairs of individuals can be compared is handled using inverse probability of censoring weights (IPCW). The truncated C-Index can be expressed as We assume that the event and censoring time are independent conditional on the variables, i.e.
, and that the probability of being uncensored at the prediction horizon is bounded away from 0. This gives rise to the IPCW estimator of , where is a weight, where the censoring probabilities used in the weight are obtained from a model to be specified (see below).
The C-index is conditional on survival to age and a separate estimated C-index is obtained for any combination of and ( ). We also considered an overall C-index which is combined across landmark ages. Consider the stacked landmark data set and let denote the landmark age for record (row) Some individuals appear in more than one row in the stacked landmark data set and we define to be the unique identifier (ID number) for the individual in row . The overall C-index is where the expectation is with respect to two rows in the stacked landmark data set. Inclusion of the indicator ensures that an individual is not compared with herself/himself. An estimator incorporating censoring weights is where is the total number of individuals in the stacked landmark data set and the weights are .
We assumed that the probabilities in the weights do not depend on or and therefore used in place of s and in place of . The probabilities were estimated separately from each landmark age using Kaplan-Meier estimates. A similar approach was used for the weights .
In summary we obtained for and for and

Brier score
The Brier score is the mean squared prediction error. As for the C-index, we obtained separate Brier scores at each landmark age and an overall brier score. In the absence of censoring an estimator of the Brier score is where is the model-based estimated probability of survival to age for individual in the landmark data set at age , is the observed indicator of survival to age , and the sum is over the individuals in landmark data set ( ). An estimator incorporating inverse probability of censoring weights is where is the event indicator, is an indicator taking value 1 for individuals who have the event or whose censoring age is after , and zero otherwise, and is the probability of being censored beyond age . The inverse probability of censoring weights were obtained using Kaplan-Meier estimates stratified by landmark age.
The overall Brier score estimator is where the sum is over all rows in the stacked landmark data set and .
Brier scores were also obtained under a null model using Kaplan-Meier estimates of the survival probabilities stratified by landmark age but with no other predictors. These are denoted and . The percentage reduction in the Brier score from a given model compared with the null model was calculated using and .
In summary we obtained for and for and , and the corresponding percentages reductions in the Brier score relative to the null model.

Calibration plots
After selecting the final model, calibration plots were obtained to show graphically the agreement between predicted survival probabilities from the model and the 'true' probabilities. The steps for creating these plots were as follows: Steps (i)-(iii) are the same as described earlier, in the Overview section of eAppendix 3.

eAppendix 4. Software
All analyses were performed using R. The landmark models described in eAppendix 2 can be fitted easily using the coxph function from the survival package after some rearrangement of the data. 13 Some of the data rearrangement can be performed using the dynpred package, 14 for example using the cutLM function, though we did not use that here. Estimated survival probabilities can be obtained using 'predict' after coxph, though special code was written to obtain the predicted survival probabilities from Model 4, which included time-varying coefficients.
There exist various packages for obtaining C-indexes and Brier scores. None of the existing functions for estimating the C-index appear to accommodate a stratified baseline hazard, and so we used bespoke code. We used 'pew' from the dynpred package to estimate the Brier scores; this requires pre-estimation of matrices of predicted survival and censoring probabilities.
The multivariate mixed model used to obtain the additional predictors for Model 6 was fitted using the lme function from the nlme package. 15 Existing software, including the nlme package, does not appear to allow out-of-sample predictions from mixed models. We therefore used bespoke code which is available from https://github.com/ruthkeogh/landmark_CF .

eAppendix 5. Final model specification
R code for obtaining estimated survival probabilities from the final model is provided at https://github.com/ruthkeogh/landmark_CF. This includes csv files containing estimated cumulative baseline hazards for each landmark age ( ).

eAppendix 6. Comparisons with other models
In an analysis of the French CF Registry Nkam et al reported a cross-validated C-statistic of 0.90 for prediction of 3-year survival. 16 They did not report a Brier score. Aside from focusing on 3-year survival and using different set of predictors, there are a number of differences between their approach and ours. They used a composite outcome of death and transplant, and for their logistic regression analysis, they excluded individuals who were censored before the end of the 3-year followup period.
Liou et al used a logistic regression analysis of the US CF Registry to predict 5-year survival. 17 A calibration plot showed good performance using a validation data set. However, they did not present measures of predictive performance that are comparable to those in this paper. Mayer-Hamblett et al also used a logistic regression analysis of the US CF Registry to develop a model for predicting 2-year survival. 18 They presented an ROC curve but did not report an area under the ROC curve, which could be compared to our C-Index. They presented sensitivities and specificities, and positive-and negative predictive values, finding that their model was better at predicting who would survive 2 years than who would die.
McCarthy et al developed the CF-ABLE score using logistic regression modelling of data from the CF population in Ireland. 19 Based on a validation data set, the area under the ROC curve was 0.82 for 4year survival, though it is not clear how censoring was treated. eFigure 1. Summary of data exclusions and creation of data set for analysis. eTable 1. Summary of number of individuals, deaths, censorings and total person time at risk in each landmark data set. The stacked data set is formed by combining the landmark data sets. b We assessed the impact on predictive performance of including two treatments that were included in the model of Nkam et al for the French Registry: use of oxygen therapy and use of non-invasive ventilation. 16 Nkam et al also investigated use of oral corticosteroids, but there was insufficient data on use of this treatment in the UK data. We created binary variables at each landmark age, which indicate whether an individual had ever used each treatment in the past. The adjusted hazard ratio associated with oxygen use was 1.75 (95% CI 1.50-2.05) and the adjusted hazard ratio associated with non-invasive ventilation is 1.15 (95% CI 0.92-1.43). Therefore both oxygen therapy and non-invasive ventilation are associated with an increased mortality hazard (though the association for non-invasive ventilation is not statistically significant), because these treatments are used by sicker patients. The estimates do not have a causal interpretation. (i) Characteristics of example individuals a in groups defined by 5-year survival probability.