Variational Temporal Deconfounder for Individualized Treatment Effect Estimation with Longitudinal Observational Data

Purpose This paper proposes a new approach, Variational Temporal Deconfounder (VTD), for estimating individualized treatment effects (ITE) from longitudinal observational data, where we address the hidden confounding issues by using proxies (i.e., surrogate variables that serve for unobservable variables). Methods We build VTD by incorporating a variational recurrent autoencoder that learns the latent encodings of hidden confounders from observed proxies and an ITE estimation network that takes the learned hidden encodings to predict the probability of receiving treatments and potential outcomes. Results We test VTD on both synthetic and real-world clinical data, and the results from synthetic data experiments demonstrate VTD’s effectiveness in deconfounding by outperforming existing methods, while results from two real-world datasets (i.e., Medical Information Mart for Intensive Care version III [MIMIC-III] and the National Alzheimer’s Coordinating Center [NACC] database) suggest that the performance of the VTD model outperforms existing baseline models, however, varies depending on the assumptions of underlying causal structures and availability of proxies for hidden confounders. Conclusion The VTD offers a unique solution to address the confounding bias without the “unconfoundedness” assumption when estimating the ITE from longitudinal observational data. The elimination of the requirement for the “unconfoundedness” assumption makes the VTD more versatile and practical in real-world clinical applications of personalized medicine.


Introduction
Estimating treatment effect-the causal effect of a given treatment or intervention on an outcome, plays an important role in evidence-based medicine, providing quanti ed measurements of bene ts or harms for the treatment of interest, which help regulators to make regulatory decisions, health care community to develop guidelines and decision support tools, and clinical professionals to decide the treatment choices in their clinical practice. Randomized controlled trials (RCTs) have been widely used as the goldstandard to estimate the average treatment effects (ATE), measuring the difference in average outcomes between individuals in the treatment group and those in the control group. In a well-designed RCT, patients are randomly assigned to the control and treatment groups, such that the units in the treatment vs. control groups are identical across all known and unknown factors to reduce the potential bias [1].
However, there are limitations of RCTs as they are not only time-consuming and logistically complex to conduct, but also the study results may not generalize beyond the study population, especially to the realworld populations where the treatment will be applied [2]. In recent years, the rapid growth of electronic health record (EHR) systems has made large collections of longitudinal observational real-world data (RWD) available for research to generate real-world evidence (RWE) [3]. The U.S. Food and Drug Administration (FDA) provided guidance on using RWD like EHRs and claims data to support regulatory decision-making [4] and has recently approved a new use for Prograf (tacrolimus)-originally approved to prevent organ rejections in liver transplants, and now approved for lung transplants, based on a observational study providing RWE of effectiveness [5]. Nevertheless, the observational, noninterventional study needs to be well-designed, accounting for the inherent biases in observational data such as confounding issues and selection biases. Further, moving beyond ATE, there is also a strong desire to obtain an individualized treatment effect (ITE), considering the heterogeneity of the target patient population and their differential responses to the same treatment. In recent years, ITE estimation based on more accessible observational data like EHRs has been a thriving research area to ll the gap [6].
One of the most critical issues of estimating ITEs from observational data is confounding -when variables can affect both the outcomes and the interventions [7]. These variables are thus called confounders. For example, socioeconomic status can affect both the medication a patient has access to and the patient's general health. Therefore, socioeconomic status acts as a confounder between medication and health outcomes. If confounders can be measured, the most common way to counter their effect is by "controlling" them in the models [8]. Many approaches for estimating ITE from observational data have been proposed according to this solution, which can be categorized into 2 groups: (1) covariate adjustment [9][10][11][12], and (2) propensity score re-weighting [13][14][15]. Most of these approaches are built on a commonly used assumption of "unconfoundedness," where all variables that affect both the interventions and outcomes are observed and measured. However, the unconfoundedness-based models will lead to biased ITE estimation when certain confounders are hidden or unmeasured [16]. In reality, it is unlikely that we can observe and/or directly measure all confounders in real-world observational studies. For example, RWD like EHRs often do not have variables such as environmental factors or personal preferences, which are potential confounders. A possible way of modeling hidden confounders is through modeling their proxies (i.e., surrogate measures). For example, stigma is an important factor in clinical care and outcomes for a variety of conditions, from infectious diseases to mental health. However, a stigma questionnaire might not be always administered to people during healthcare encounters. Mental health assessment and other measurements, however, might be available and be used as surrogates [17].
Several approaches have built deconfounding ITE estimation models by using proxy variables. The multiplicity of causes [18] and matrix factorization [19] are used to infer the confounders from missing or proxy variables. More recently, the variational autoencoder (VAE) [20]-a deep generative model with powerful hidden representation learning ability, has been applied to model hidden confounders [8, 21,22] and achieved superior performance. However, these variational generative model-based approaches are designed for cross-sectional settings and cannot be directly adopted for a longitudinal setting. In realworld clinical practice, EHRs contain rich time-dependent patient information such as lab results, vital signs, and medication use across their encounters with the health system. With such longitudinal data, we can answer some essential questions such as what is the optimal time to administer a treatment, when the treatment regime needs to be stopped, or in which order treatments should be given to obtain the best treatment response [6]. Only a few attempts have built longitudinal ITE models [23,24], and none of them have tried to use the variational generative approach to model the hidden confounders over time.
In this paper, we propose the Variational Temporal Deconfounder (VTD), a novel method for ITE estimation that leverages the variational autoencoder to model hidden confounders in a longitudinal setting. Instead of assuming no unobserved hidden confounders, we create embeddings of latent variables to recover the distributions of hidden confounders from the proxies over the observational data space. Our approach is two-fold: (1) a transformer-based factor model that can infer the latent random variables with a variational autoencoder that learns the hidden confounders from variations of observed proxies while capturing the dependencies among the hidden confounders at neighboring timesteps; and (2) a timestep-wise variational lower bound together with the prediction loss to integrate joint training of the latent factor model with the ITE estimation task. We highlight our VTD, same as the time series deconfounder [23], works as an unbiased ITE estimation approach requiring weaker assumptions than standard methods over observational data. To show the effectiveness of VTD, we rst conducted a simulation study to investigate VTD's capability to infer latent variables where we explicitly created hidden confounders. Then we evaluate VTD on two real-world datasets: (1) the Medical Information Mart for Intensive Care III (MIMIC-III) dataset with patients admitted into Intensive Care Units (ICU), and (2) data from with the National Alzheimer's Coordinating Center (NACC) database to evaluate our proposed methods. We emphasize that observed covariates in are proxies of hidden confounders. We denote the unobserved confounders for proxies as r-dimensional random variables where . Figure 1 shows the causal structure between hidden confounders and other variables.
We adopt the potential outcome framework under the longitudinal setting proposed by Robins and Hernán [25] who extended it from the static setting of Neyman [26] and Rubin [27]. Let denote the historical covariates collected before time t. For each patient, given observed covariates and treatment of ,we want to estimate individualized treatment effects (ITE), i.e., potential outcomes conditioned on as 1 We adopt two standard assumptions [28] for ITE estimation: Consistency. If, then the potential outcome for treatment assignment is the same as the observed outcome, i. e., .

Assumption 2 Positivity (Overlap). If then
Other than these two assumptions, the majority of other ITE methods also assume unconfoundedness or sequential ignorability, i. e., 2 for all and , which holds only if there are no hidden confounders. In our setting, we observe proxies instead of hidden confounders , where unconfoundedness is violated, and using standard methods will result in biased ITE estimation.
We address this by using the VTD, which learns a hidden embedding that re ects the true hidden confounders from variations of observed proxies and also captures the dependencies among at neighboring timesteps.

The Variational Temporal Deconfounder model
We introduce our VTD model as follows: (1) the architecture of the VTD that consists of a variational recurrent autoencoder and an ITE block to produce the hidden embedding and ITE estimation, respectively; and (2) the variational bound of VTD, which ensures the embedding of the hidden confounders can be learned by standard gradient-based optimization.

The architecture of the VTD
The VTD consists of two main components: (1) a variational recurrent autoencoder, which learns the latent variables of hidden confounders the observed proxies ; and (2) an ITE estimation block, which takes learned hidden embedding to predict the probability of receiving treatment and potential outcome . We illustrate the architecture of VTD in Fig. 2.
(1) The variational recurrent autoencoder. The variational recurrent autoencoder uses a recurrent encoderdecoder framework where a transformer is introduced to adjust for the time-varying structure of a longitudinal setting shown in Fig. 1. The encoder maps the input proxies from the observed space to the latent space of hidden confounders . In the encoder, a transformer [29] takes a sequence of observed proxies x and outputs the hidden states h accordingly, followed by a fully-connected layer that takes the outputs of the transformer layer and maps the hidden states and of each time step onto the latent embedding , i.e., 3 Further, when implementing the transformer, we also consider the elapsed time between the patient's two consecutive encounters (with the health system) in order to take into account the time-varying effect of clinical events (e.g., a bone fracture happened a year ago would have a different effect on the patient's current health status comparing to a bone fracture happened a week ago). Thus, we add the elapsed time along with the input and de ne the generation of the hidden state as (2) The ITE estimation block. Leveraging the learned hidden embedding as the representation of hidden confounders, we estimate the ITE by incorporating two tasks of predicting (i). the probability of receiving treatment and (ii). the outcome .
We use a fully-connected layer that takes the embedding to predict the predicted probability of receiving treatment as , i.e.,

6
We also use a fully-connected layer to predict the outcome , which takes the hidden embedding together with the assigned treatments i. e., 7 Then we compute the weights using the inverse probability of treatment weighting (IPTW) and extend them to a dynamic setting as follows, where denotes the probability of being in the treated group. By incorporating (7) with our outcome prediction, we de ne the supervised loss as

The variational bound of VTD
VAE was proposed to model complex multimodal distributions of hidden factors over the space of the observed dataset. We de ne the joint distribution of observed proxies and latent confounders over time steps as follows, x z T In the standard VAE, the latent random variable follows a standard Gaussian distribution. To re ect the causal structure in Fig. 1, we assume follows a a prior Gaussian distribution as where is a function that maps the hidden states and to the parameter space of and . And we also assume follows a Gaussian distribution Now our goal is to infer the parameter of the posterior . By following the paradigm in [30,31] we introduce the variational distribution and transfer the problem of inferencing to maximize where denotes the marginal likelihood lower bound (ELBO) of the full dataset. We incorporate the supervised loss of ITE estimation and to de ne loss as

Experiments
We demonstrate the effectiveness of the VTD in experiments using a synthetic dataset, the MIMIC-III dataset, and the NACC dataset. We show that the VTD reduces confounding bias in ITE estimation from the empirical observation from both experiments. We compared VTD with the following causal inference approaches: (1) G-formula, a generalized approach to the standard regression model over the longitudinal setting that can be used to adjust for time-varying confounders [32]; (2) Deep Sequential Weighting (DSW), which infers the hidden confounders by incorporating the current treatment assignments and historical information using a deep recurrent weighting neural network [24]; (3) Time Series Deconfounder (TSD), which leverages the assignment of multiple treatments over time to enable the estimation of treatment effects in the presence of multi-cause hidden confounders [23].
We report the Rooted Mean Square Error (RMSE) between predicted and ground truth outcomes to measure models' performance on conventional prediction tasks. To evaluate ITE estimation, the most common measurement is the Precision in Estimation of Heterogenous Effect (PEHE) [33], de ned as the mean squared error between the ground truth and estimated ITE, i.e., However, in real-world datasets, the counterfactual is never observed; thus, we use the in uence function -PEHE (IF-PEHE) that approximates the true PEHE by "derivatives" of the PEHE function [34].

The synthetic data
In the problem formulation section above, we introduced that the treatment assignments at each time step are determined by confounders , which also include previous hidden confounders , current time-varying covariates and static features . The and are generated for each patient at a given time through an autoregressive process, and these generation processes take into account historical information as well as the in uence of previous treatment assignments, so we de ne the following equations to generate covariates and hidden confounders , The confounders and outcome at each time stamp are generated using the hidden confounders and current covariates as follows, 17 where the in uence of hidden confounders being controlled by a confounding factor , and and are weights and biases of a linear model. The function maps the concatenated feature vectors into the hidden space. For this study, we used confounding factor =0.1, 100 covariates, and 10 time steps when generating the samples.

The MIMIC-III dataset.
Following the similar setting of Bica et al [23], we constructed a dataset based the Medical Information Mart for Intensive Care version III (MIMIC-III) [35]. The MIMIC-III dataset contains more than 61,000 ICU admissions from 2001 to 2012 with recorded patients' demographics and temporal information, including vital signs, lab tests, and treatment decisions. We extracted 11,715 adult sepsis patients ful lling the sepsis3 criteria [36] as our studied cohort from MIMIC-III.
Here, we obtain 27 time-varying variables (i.e., vital signs: temperature, heart rate, systolic, mean blood pressure (MBP), diastolic blood pressure, respiratory rate, oxygen saturation (SpO2); lab tests: sodium, chloride, magnesium, glucose, blood urea nitrogen, creatinine, urineoutput, glasgow coma scale, white blood cells count, bands, C-Reactive protein, hemoglobin, hematocrit, aniongap, platelets count, partial thromboplastin time, prothrombin time, international normalized ratio, bicarbonate, lactate) and 8 static demographics (i.e., age, gender, race, metastatic cancer, diabetes, height, weight, body mass index) variables. We design two causal inference tasks considering two available treatment assignments: vasopressors and mechanical ventilator (MV). For each treatment option, we separately evaluate its causal effect on the important outcomes of interest. For vasopressors, we adopted MBP as the target outcome; and for mechanical ventilator, we adopted the SpO2 as the outcome. We consider the rest of the variables as the observed covariates.

The NACC dataset
Follow a similar process, we construct the longitudinal data from the National Alzheimer's Coordinating Center (NACC) Uniform Data Set (UDS) [37]. The NACC-UDS is a database that collects demographic, clinical, diagnostic, and neuropsychological data from 29 Alzheimer's Disease Centers (ADCs) from recruited participants with normal cognition, mild cognitive impairment (MCI), and dementia at baseline and being followed annually, since 2005. We collected data from the NACC-UDS between June 2005 and June 2021 to formulate 2 separate datasets with patients of different baseline conditions, i.e., (1) baseline-1: patients who were diagnosed with MCI and age above 50; and (2) baseline-2: patients with normal cognition and age above 65. We extracted 2,401 and 5,555 patients for baseline-1 and baseline-2 respectively with over 268 variables, and the detailed variables' information can be found in the Appendix A section. We considered three treatments assignments, i.e., statin, anti-hypertensive, and non-steroidal anti-in ammatory drugs (NSAID) and aim to estimate their effects on reducing the risk of Alzheimer's disease (AD). ] γ effectively capture the information of hidden confounders within a temporal structure, resulting in a more accurate estimation of ITE. Furthermore, the deep representation-based models exhibit a signi cant improvement over the baseline G-formula, attributed to their capability to handle complex and highdimensional data through the utilization of neural networks as the underlying architecture.  Table 2 presents the evaluation of VTD's effectiveness in deconfounding by assessing its performance with different strengths (i.e., adjusting γ) of hidden confounders Z. The setting is similar to the previous experiment on synthetic data, and we report the RMSE on the outcome prediction as the performance metric. The results indicate that the proposed VTD outperforms the other baselines and the performance of the VTD increases when the confounding factor increases. It should be noted that both baselines and VTD are evaluated on the same data, thus the performance gain is due to VTD's more effective modeling of hidden confounders. The results demonstrate that conditioning on the hidden embedding learned by VTD results in more robust outcome predictions and reduces the bias in ITE estimation. We evaluate the performance of the VTD on the benchmark MIMIC-III dataset which a real-world dataset. So we don't have the knowledge of the true hidden confounders in this dataset. Table 3 demonstrates

Results
Page 12/19 that the VTD model outperforms both the TSD and the G-formula on all measures and provides better outcome predictions in the " vasopressor-MBP " setting, with similar performance in the " MV-SpO2" setting compared to DSW on the MIMIC-III dataset. This indicates that the VTD, with its time-aware Transformer backbone, can bene t from learning the patterns of irregular elapsed time between consecutive events.  Table 4 and Table 5 show the performance of four models on baseline-1 and baseline-2, respectively. We see VTD gains more edges on both settings for outcome prediction power. While we have did not observe better performance of IF-PEHE for VTD.  In this paper, we introduced a novel approach, the Variational Temporal Deconfounder (VTD), for estimating the individual treatment effect (ITE) in a longitudinal setting. The method addresses the problem of hidden confounding, which is a critical issue in ITE estimation from observational data such as electronic health records (EHRs). We demonstrated the effectiveness of VTD's deconfounding ability with synthetic data over different strengths of confounding factor. The results of the two experiments on synthetic data are consistent and demonstrate that the VTD consistently outperforms existing methods in terms of ITE estimation accuracy and IF-PEHE. In the real-world application using MIMIC-III, we can see VTD performs better than the baseline G-formula and the TSD model, similar to DSW on the IF-PEHE metric (with a few cases, where VTD performed worse than DSW). In the NACC dataset, we observe some superior results on outcome predictions and worse-but competitive-results on IF-PEHE comparing to DSW.
However, DSW is a deep learning-based approach built on the assumption of unconfoundedness, which our VTD does not assume. It is also interesting that the VTD model performs well in the NACC dataset comparing to MIMIC-III, where the two real-world datasets capture different disease/application settings.
The MIMIC-III data capture care in the ICU settings, while the NACC dataset captures the setting of chronic diseases (i.e., the development of AD). Thus, similar to the selection of traditional machine learning algorithms (e.g., support vector machine vs. random forest and others) for a prediction task that depends on the assumptions of the underlying data distributions, the selection of an appropriate ITE estimation methods really depend on our assumptions (or no assumption of) of the underlying causal structures (e.g., whether there exists hidden confounders and whether there are proxies exist for the hidden confounders), which explains some of the variations of the model performance across different datasets and settings.
The overall improvement of the VTD model lies in its ability to address the problem of hidden confounding in a longitudinal setting, which were not addressed in most previous ITE estimation methods. The use of auto-encoded variational inference allows the model to create latent variables that recover the distributions of hidden confounders, making it possible to estimate ITEs even in the presence of hidden confounders.
However, there are some limitations to the VTD model. First, the VTD model assumes proxies for hidden confounders are available in the observational data. In cases where these proxies are not available, the VTD may not be the most suitable choice. Second, our evaluation of the VTD in the real-world applications is limited to a surrogate metric IF-PEHE, while true gold-standards are not available.
In sum, the ability to estimate ITEs in a longitudinal setting while taking into account the existence of hidden confounders makes the VTD model particularly useful for personalized medicine, where the goal is to optimize treatment choices for individual patients based on their unique characteristics observed over time. Nevertheless, further investigations are needed as the unconfoundedness assumption may not hold in certain real-world applications. Identifying the types of real-world applications where unconfoundedness holds or not is thus critical to guide the choice of the modeling approach.

Declarations
Competing Interests: No completing interests.
Author contribution: ZF, MP, and JB conceptualized the study idea and carried out the study design. ZF completed the implementation and carried out the experiments. YG provided critical feedback on the experiments and manuscript. ZF and JB write the initial manuscript draft. All authors reviewed and edited the manuscript.  The proposed Variational Temporal Deconfounder (VTD) model architecture.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. AppendixA.docx