Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model

Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities (d ≈ 10, 000), disease code co-occurrence probabilities within a visit (d ≈ 1, 000, 000), and conditional probabilities across consecutive visits (d ≈ 5, 000, 000) and achieve above 0.9 R2 correlation in comparison to real EHR data. In comparison to the leading baseline, HALO improves predictive modeling by over 17% in its predictive accuracy and perplexity on a hold-off test set of real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 area under the ROC curve with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.


INTRODUCTION
The widespread adoption of electronic health record (EHR) systems has established the foundation for machine learning (ML) and artificial intelligence (AI) applications in healthcare. The EHR data is highly complex, comprising over 10,000 unique medical codes for diagnoses, procedures, and medications, as well as thousands of lab measurements. Each patient record can include multiple visits with combinations of diagnoses, procedures, medications, and labs. These combinations create intricate relationships and complex patterns across tens of thousands of medical codes. AI and ML techniques are used to learn and model complex patterns in EHR data, enabling applications such as clinical predictive modeling [1,2], health monitoring [3,4], computational phenotyping [5,6], treatment recommendations [7ś9], and more. However, the progress of AI and ML in healthcare is often impeded by the difficulty of accessing and sharing large real EHR datasets. Sharing EHR data is challenging due to privacy, security, and legal constraints. While patient de-identification can alleviate some of these concerns by removing obvious patient identifiers such as name, address, and birth date [10,11], studies have shown that the risk of re-identification remains high even after thorough de-identification [12ś14].
Using synthetic patient data can offer a safer alternative to sharing real EHR data. Generative models can produce synthetic datasets as substitutes for real patient data [15ś21]. Various methods have been proposed in the literature, including structured patient record generation [19, 20, 22ś24] and longitudinal record generation [15,16,21].
To date, existing methods have not been able to generate realistic EHR data in its original, high-dimensional form. The high dimensionality of EHR data, along with rare and sparse variables and complex relationships among variables, makes the generation task extremely difficult. Consequently, existing approaches all concede to creating lower-dimensional data by either aggregating variables or using a subset of more common variables of a manageable size. For example, the MedGAN method [19] modeled 615 disease categories without longitudinal information; the SynTEG model [15] aggregates codes to higher level phenotypes and then removes rare phenotypes, resulting in only 1,276 variables; the ehrMGAN approach [21] reduced the variable dimension to be less than 100, and EVA [16] models frequent co-occurrence patterns in the original EHR data as one-hot vectors, limiting its ability to generate diverse and novel co-occurrence patterns. Table 1 shows that previous approaches have been limited in their ability to model the full dimensionality of real patient data. While these low-dimensional approaches may capture the proper statistics on a small number of variables and support narrow ML use cases relying solely on those variables, the resulting synthetic data is inadequate for broader applications that require high-dimensional data including comprehensive statistical analysis, patient phenotyping, billing prediction and analysis, disease staging, and comprehensive data sharing.
We propose a new approach for generating high-dimensional EHR data in its native form: the Hierarchical Autoregressive Language Model (HALO). This model takes an autoregressive and probabilistic approach, and can capture the hierarchical distribution of EHR records and their temporal relationships. By using a hierarchical approach to model binary sequences of over a million variables, HALO is able to efficiently learn and represent complex patterns in EHR data. We evaluate the performance of HALO by training it on a comprehensive outpatient claims dataset, as well as the MIMIC-III inpatient EHR data [25], and compare the results with a diverse set of existing synthetic EHR data generation techniques Method Dimensionality CONAN [18] 128 * CorGAN [17] 1, 071 * EHR-M-GAN [21] 98 EMR-WGAN [22] 944 * EVA [16] −Ĥ GAN [23] 926 * MedGan [19] 615 * MedWGAN [20] 1, 651 * SynTEG [15] 1,276 HALO 9, 882 Table 1: Previous ML approaches for generating synthetic EHR data and their respective dimensionality. * signifies a non-longitudinal output (producing either a patient embedding or a single aggregated vector instead of a series of visits) while^signifies the special case of one-hot vector output that can only generate a limited number of common code combinations per visit predefined based on patterns from the training EHR data. No past approaches have ever produced synthetic health record data matching the highdimensionality (on the order of 10,000+ medical codes).
such as [15,16,26]. We evaluate the data quality based on its utility in modeling the statistical data distribution and for supporting ML models. HALO can accurately synthesize high-dimensional EHR data via modeling disease code probabilities ( ≈ 10, 000), disease code co-occurrence probabilities within a visit ( ≈ 1, 000, 000), and conditional probabilities across consecutive visits ( ≈ 5, 000, 000).
In our experiments, we found that HALO achieves a correlation coefficient of above 0.9 2 when compared to real EHR data, demonstrating its ability to generate realistic data. In addition to generating high-fidelity and granular EHR data, we show that HALO improves predictive modeling on our EHR dataset by more than 17% compared to the leading baseline. We evaluate the predictive accuracy and perplexity of HALO on a hold-off test set, demonstrating its superiority. Furthermore, the synthetic data generated by HALO enable downstream ML models to achieve comparable accuracy to models trained on real data, with an AUC of 0.938 for HALO data versus 0.943 for real data. We then demonstrate that combining real and synthetic data generated by HALO can improve the accuracy of ML models even more compared to using just real EHR data. Furthermore, we show that HALO generates realistic data while simultaneously protecting the privacy of patients in the training data, as evaluated by a series of privacy metrics.

RELATED WORK
Structured EHRs are multi-level longitudinal records, where each patient is represented by a sequence of visits. Each visit is characterized by a set of medical codes, reflecting the diagnoses, procedures, and medications administered during that visit. Additional patient information, such as demographics, disease phenotype labels, lab test results, and inter-visit time, can also be included.
Of all the EHR generation methods, rule-based approaches, such as Synthea [27] or SynPUF [28], have proven to be the most effective in delivering practical value. These simple approaches either offer de-identification of real records by combining data across multiple patients in a sufficiently privacy-preserving way [28], simulation of patients within a complex yet constrained rule-based system [27], Bayesian probabilistic modeling of aggregated, non-temporal patient records [29], or proprietary method without detailed explanation [30ś32]. Many of these systems can only produce synthetic patient data with limited capacity in realism and utility. We focus instead on ML methods that have the potential to generate realistic high-dimensional synthetic patient data.

GAN-based Methods
Many synthetic data generation methods use Generative Adversarial Networks (GANs), which involve a generator that creates realistic data, and a discriminator that decides if the data is real or fake [33]. The GANs has been applied to patient record generation first in [19] followed by many other GAN-based approaches [15, 17, 18, 20ś24, 34]. However, GANs have limitations when generating sequential data like EHRs. They usually only produce one output (no time connections) and so most EHR generation methods aggregate EHR data into one time step [22ś24], create a representation of EHR data [18], or do both [19,20].
GANs also struggle with high dimensional and sparse data like real-world EHR, limiting all existing synthetic EHR GAN approaches to produce relatively low dimensional data through the aggregation of visits and medical codes or removal of rare codes. For example, there are a few methods in this category which do generate longitudinal data. LongGAN [34] and EHR-M-GAN [21] both focus only on dense lab time series of under a hundred dimensions. CorGAN [17] generates records with 1,071 distinct codes, and the current state of the art GAN approach that we baseline against, SynTEG [15], both combines and removes rare codes before arriving at a final dimensionality of 1,276.
While GANs have the potential to be conditioned on external factors and labels, such as demographics or disease phenotype labels, the ability to do so has not been extensively explored in existing works on EHR generation. Moreover, there are only a limited number of approaches that can generate synthetic EHR data tailored to specific diseases. For example, SmoothGAN [24] focuses on aggregated lab and medication information and does not model individual visits; EHR-M-GAN [21] offers conditional and sequential capabilities, but for low dimensional (under 100 dimensions) lab time-series information; CONAN and MaskEHR [18,35] model only a single rare-disease population for data augmentation; and EMR-WGAN and HGAN [22,23] can only model low-dimensional (both under 1000 dimensions) aggregated EHRs.

Deep Sequential Methods
Accurately modeling the longitudinal nature of EHRs is crucial for realistic EHR generation. In recent years, two methods have shown progress in generating sequential EHRs by using either a GAN or a VAE to condition on representations of past patient visits to generate current visits [15,16]. Specifically, SynTEG [15] models the time between visits, and EVA [16] offers a conditional variant. In our experiments, we compare HALO to these two models. However, both SynTEG and EVA often need to perform preprocessing steps to reduce the dimensionality of the vocabulary by aggregating medical codes and removing rare codes.

Language Models
Our objective is to develop an improved method for generating realistic and high-dimensional EHR data by drawing inspiration from natural language generation. Language generation models predict the next word based on the preceding words, thereby learning a probability distribution of languages. Similarly, EHR models predict the next visit based on past visits. Also our proposed method provides an explicit probability output that allows for direct modeling and evaluation of the underlying data distribution. This approach is particularly beneficial in accurately capturing the complex and high-dimensional nature of EHR data. The Transformer architecture, introduced in [36], has revolutionized natural language processing and enabled the development of large, attention-based models like BERT [37] and GPT [26,38,39]. Among these models, we draw inspiration from GPT, which relies on a stack of Transformer decoder blocks that use masking to predict the next set of probabilities in parallel, allowing for fast training and scalability. However, applying language models directly to EHR data poses unique challenges. Unlike natural language sequences, EHR data exhibits a hierarchical structure that must be captured, with medical codes associated with specific patient visits, and visits associated with individual patients. Additionally, EHR data contains heterogeneous elements, including demographic variables, structured medical codes, and numeric lab measures, not all of which are discrete tokens. Addressing these challenges requires novel approaches that leverage the strengths of language models while adapting them to the peculiarities of EHR data.

METHOD 3.1 Problem Formulation
We first formalize the problem and introduce key notations.

EHR Data
We represent a patient record R as a sequence of visits over time such that where each visit V ( ) contains a varying number of medical codes ∈ L, and the inter-visit time gap ( ) . C is then the set of all medical codes in our vocabulary, including diagnoses, procedures, and medications and L is the set of all labs. Beyond the longitudinal records, a patient record also possesses some static information S containing demographics such as gender, race, and birth year and disease phenotype label D indicating major and persistent disease conditions.

Matrix Representation
To allow input to HALO and other machine learning models, we then convert R, S, and D into a matrix representation R. Specifically, we build R = [v , v , v 1 , · · · , v , v ], a matrix containing a sequence of the vector representations for each of the patient's visits, a preceding łstart visitž, łlabel visitž and a succeeding łend visit. ž The -th medical code in V ( ) The gap between the − 1 and -th visits S A patient's static demographic information D A patient's chronic disease information L The set of all labs ∈ N The number of visits in R C The set of all medical codes R ∈ R ( +3) ×|C| The matrix representation of R, S, and D v ∈ R |C| The vector representation of the -th visit in R ∈ {0, 1} The binary presence of the -th code in C in v The start visit v is a one-hot vector containing a special start code added to C to signify the start of the record often required for certain model architectures.
The label visit v similarly contains special codes added to C representing demographic and chronic disease phenotypes from S and D, respectively. For example, this label visit will have codes representing the patient's gender, racial and ethnic groups, birth year, and any chronic labels.
Each subsequent visit v ∈ R | C | is then represented as a multihot binary vector representing medical codes, lab values, and intervisit gaps present in that visit. To represent continuous lab values and visit gaps in a discrete form, we employ a granular discretization. This is achieved by adding multiple range codes to C for each lab test and for the intervals between visits. By converting all medical information into binary variables, represents the presence of the -th code in C in the -th visit of the patient record R.
Finally, to signal the end of the patient record in v , a special last visit code is added to C, serving a similar purpose to a stop token in natural language generation. This not only enables generative models to learn when to terminate records but also allows for R to be padded through additional columns into a constant length for batch input without altering its content. Figure 1 depicts the format of the visit vector and the EHR representation. Table 2 lists relevant notations used in the paper.
Generation task is to create R ′ , a synthetic patient record that is statistically similar to and offers the utility of R without any one-to-one mapping to a real patient. Our HALO method does this by learning distribution (R).

Hierarchical Autoregressive Language Model (HALO)
We model the probability of patient record R, (R), via a hierarchical autoregressive model, which utilizes both visit-and code-level structures of a patient record. First, it factorizes the probability along the visit level using the autoregressive identity by to produce what we call our coarse autoregressive sequence. We then continue to factorize the probability of visits further along the code level by converting into what we call our fine autoregressive sequence. This final probability is then rewritten as the product where the probability of each code is based on each of the previous visits and each of the previous codes in the current visit. Our multigranularity approach enables the modeling of sequences of many binary variables per record. This is achieved by grouping prior information into significantly fewer multivariate time steps for previous visits, while retaining the full autoregressive modeling capability for each current visit. Our HALO architecture is designed to reflect this powerful yet compact model, with a structure divided into two distinct granularity levels: visit level and code level. This allows for each code to be conditioned on all previous visits and the past codes of the current visit.

Visit-Level Module.
We begin with the coarse, visit-level granularity. We use a stack of transformer decoder blocks to generate a sequence of visit-level histories, where the -th element in the sequence, h ( ) ∈ R emb , is an embedding that represents all of a patient's medical history through their -th visit. Those histories then combine to form H ( ) ∈ R ( +3)× emb (where the 3 in + 3 includes the start, label, and end visits), the output of the first module which serves of the purpose of the v , v , v 1 , · · · v −1 priors in Equation 4.
To encode each of the multi-hot visit representations [v 1 · · · v ] into a fixed-length vector in R emb , we employ an embedding layer that includes two trainable parameter matrices: a code embedding matrix W and a positional embedding matrix W . The code embedding matrix maps each visit code to a dense vector representation, while the positional embedding matrix captures the relative position of each visit in the sequence. Next, we use a decoder model consisting of = 12 transformer decoder blocks to generate a series of visit history representations, which summarize the information contained in all previous visits in the coarse, visit-level sequence. The transformer decoder blocks employ masked multihead self-attention, which allows the model to attend to all previous visits while preventing information leakage from future visits. This process is written more formally as where R ∈ R ( +3)× is the patient record matrix representation, W ∈ R × emb is the code embedding matrix, W ∈ R ( +2)× emb is the positional embedding matrix (to recapture the position and order of the sequence of visits), and each transformer block is based on a decoder block from the original transformer architecture [36] which we describe in more detail in our supplemental material.
Summary: Having processed the multi-hot patient visits through the initial, coarse visit-level module of our architecture, we obtain a sequence of visit history representations H ( ) , which capture the collective information of all previous visits up to each time step. These representations provide a compressed summary of the patient's visit history, enabling downstream modules to make predictions based on the patient's medical trajectory.

Code-Level
Module. However, we still need to add in the code-level priors and generate output probabilities. To construct the input for the fine, code-level module, we offset and concatenate the previous module's visit history embedding outputs with the original record input, R. Specifically, we append the first + 2 visit histories with the last . Each of the +2 inputs in H ′(0) has a representation of the history of all the previous visits and the codes of the current visit, mirroring both the visit and code priors in Equation 4. The final input representation H ′(0) has size R ( +2)×( emb + ) To model the distribution of each ( ), this H ′ (0) is then fed through = 2 masked linear layers which maintain the same dimensionality and use upper triangular masking of the weight matrix to ensure that they preserve the autoregressive property of the probabilities (and have a ReLU activation function between layers). The probabilities are generated formally by where the submatrix indexing at the end removes the visit-level history embedding portions of each vector to extract just the code probabilities, and the masked linear layers are achieved by where the max function is omitted for the final fine layer (sigmoid is used instead), ⊙ is element-wise matrix multiplication, is the upper triangular masking matrix (with ones in the upper triangular portion and zeros in the lower portion) to preserve the autoregressive property, and W ( ) ∈ R ( emb + )×( emb + ) and b ( ) ∈ R emb + are the trainable parameters of the module. The output O ∈ R ( +2)× is then a matrix of probabilities of each code for each visit after the start visit built from the visit histories and each previous code in the same visit. Each code corresponds to a conditional probability in the product from Equation 4. We train our model using the binary cross-entropy loss function over each medical code (treating the problem as a multi-label classification problem) with masking applied such that the start visit as well as any padded visits (of all zeros) do not contribute to the loss. The architecture of our model is shown in Figure 2

Additional Features and Considerations
Finally, We discuss different variants and add-on features of HALO.

Conditional Generation.
Our method generates electronic health record (EHR) data by using demographics S and chronic disease phenotypes D as labels, which are represented in our label vocabulary and applied to individual visits, as shown in Figure 1. We selected these labels based on their relevance to downstream use cases. Each label is represented as a binary variable in v , indicating the presence of the corresponding disease or demographics group indicator. These indicators are defined by concepts such as specific categories of genders, races, ethnicity, age groups, and more. We can easily extend this strategy to include other labels of interest, such as various biomarkers, patient outcomes, or even abstract patient embeddings.

Unconditional Generation.
Our setup generates electronic health record (EHR) data with conditional labels by incorporating a "label visit" in the data format, as illustrated in Figure 1. This format enables easy generation of labeled and conditional data, which are highly valuable for using synthetic data in machine learning tasks and as an augmentation tool, particularly for rare cohorts. However, it's important to note that this formatting is optional. If desired, the "label visit" component can be removed from the EHR representation, and the architecture can be trained to generate unconditioned EHRs without any modification.

Generation of Continuous Variables.
Our model can generate not only medical codes but also continuous variables, such as lab values and temporal gaps between visits. However, the availability of these additional variables in the generated data depends on their presence in the original dataset used for training. For example, the outpatient EHR dataset used in our study includes the time between visits, while the inpatient EHR dataset includes lab values.
In previous models, continuous values were typically generated using either GANs, which lack the autoregressive probabilistic modeling that we employ, or value predictors (such as time series analysis models), which we often found to produce average values with insufficient variance. To overcome these limitations, we model continuous variables within the healthcare domain by discretizing lab values and temporal gaps into clinically equivalent buckets. The resulting binary variables are included in the model's context, denoted as C, before being converted back to continuous values through random uniform sampling within the corresponding bucket range. By using this approach, our model generates more realistic and diverse continuous variables than previous methods.
More specifically, to generate discrete versions of continuous variables, such as lab values and temporal gaps, we divide the range of each variable into several "buckets", as represented by the values 1 , 2 , · · · , | ( ) | , where | ( ) | refers to the number of buckets required. We determine the bucket ranges by either seeking advice from clinicians on practical ranges, creating granular but equivalent groupings, or using a histogram construction algorithm [40]. The same approach is applied to temporal gaps as well. For example, the heart rate lab test with possible values ranging from 0 to 400 beats per minute down could be broken down into twenty different buckets splitting the overall span into smaller ranges which offer the same medical meaning for all their contained values. This breakdown could have 1 = (0, 40) and 7 = (90, 100). These buckets then convert the single continuous variable into many binary variables. Whenever the continuous variable is present in the original EHR, a single one of those variable representing the corresponding bucket is set to 1 with the rest remaining 0. For instance, if a patient has a heart rate lab measurement of 93 bpm in their seventh visit, the seventh of the new heart rate variables would be 1 and the rest would remain 0. If there was no such lab measurement in the visit, they would all be 0.
These new binary variables are added into the wider code vocabulary C and treated in the same way as all of the other medical codes in the vocabulary by our HALO model during learning and generation. After generation, the specific lab values and inter-visit gaps are converted back into a continuous value by uniformly sampling from the corresponding bucket range at the very end. This discretization allows us to maintain the same powerful and probabilistic modeling process, matching the probabilistic variance of real continuous values in the same way we match the variance of medical code presences. However, by building appropriately granular buckets, we can avoid losing meaningful information and maintain a full representation of a patient. We explore the performance of this approach further in our experiments.

EXPERIMENTAL RESULTS
We evaluate our method and compare it to several baselines comprising both recently proposed models and other logical autoregressive model architectures on a series of experiments on both outpatient and inpatient EHR datasets. To maintain the fidelity of the original EHR data, our experiments focus on synthesizing original granular medical codes without aggregating or combining codes. Specifically, we seek to answer the following questions.

Datasets and Experimental Setup
Datasets We use two datasets for our experiments: (1) The outpatient EHR is from a large real-world US claims data.
It contains 929,268 patients and binary labels for 11 chronic diseases (specific diseases and patient counts are included in the supplementary material). This yields a final real-world outpatient EHR dataset with an average of 34.16 visits per record and 3.52 codes per visit with 9,882 unique ICD-10 codes. (2) The inpatient EHR is from the MIMIC-III ICU stay dataset [25].
It contains 46,520 patients with 25 disease phenotype labels as defined by the MIMIC benchmark [41]. This dataset has an average of 1.26 visits per record and 15.11 codes per visit with 6,841 unique ICD-9 codes. Note that this includes patients with just a single visit (and as we will show, HALO's Code-Level Module allow it to be very effective on those patients). Both datasets share the same patient representation as a series of visits along with chronic disease phenotype labels. We keep the ICD codes in the data without code aggregation or removing any infrequent codes.
Experiment setup: We use a 0.8-0.2 training-test split with an additional 0.9-0.1 training-validation split during training for both outpatient and inpatient datasets. We use the Adam optimizer with learning rate 1e-4 (which was arrived upon through experimentation). We use a batch size of 48 and train for 50 epochs. Finally, we implement the model and train using the PyTorch framework [42].

Baseline Methods
Below we outline the baseline methods and the necessary alterations to those baselines to adapt to our problem setting.
• HALO − Coarse This baseline is an ablation baseline consisting of just the coarse, visit-level granularity module of the full HALO architecture. It generates each code probability based on all previous visits (grouped into a multi-hot representation) but without the fine, inter-visit modeling such that ( ) is modeled by ( |v 1 , · · · , v −1 ) instead of ( |v 1 , · · · , v −1 , 1 , · · · , −1 ). It consists predominantly of 12 transformer decoder blocks in the model of [38] augmented to support multi-hot as opposed to one-hot inputs and outputs within the embedding layer and final activation layer. • GPT Model [38]. We applied the GPT model without any augmentation to support multi-hot inputs and outputs but instead with the conversion of EHRs to a fully one-hot sequential representation. However, this model had to be shrunk down to 3 blocks from 12 to fit into memory because this greatly expanded the length of the sequences. • LSTM EHR Model is a deep, autoregressive LSTM model, which is directly analogous to the HALO − Coarse model but uses LSTM blocks instead of transformer decoder blocks. • SynTEG [15] is a GAN-based model that uses a transformer and LSTM-based encoder model to generate embeddings of EHRs up to a given visit before feeding those embeddings into a conditional GAN which generates the next visit. • EVA [16] is a VAE-based model which uses a bidirectional-LSTM encoder and CNN-based decoder (using deconvolutions to expand the latent encoding to the proper temporal dimension and then masked, diluted 1D convolutions to build the records in an autoregressive manner). The only change we made was to convert the output from one-hot code combinations to multi-hot code probabilities to allow for greater representative power.

Evaluating EHR Language Modeling
The first evaluation is conducted by predicting the probabilities and outputs of the test set. In this phase, we assess the performance of HALO against two multi-hot language model baselines, namely HALO − Coarse and LSTM. These baselines explicitly generate a probability distribution without accessing the entire input. It's worth noting that other baseline models, such as the GAN-based SynTEG model, the VAE-based EVA model, and the GPT model, cannot be directly compared in this task. This is because these methods sequentially add elements within visits and/or do not make a single probability prediction for each code within the visit. Our first evaluation aims to assess the capability of the models to predict the presence of potential medical codes, given a patient's past medical history and the previous codes from the current visit. Note that we explore different orderings of codes (such as most to least prevalent, alphanumeric, random, etc.) but find no noticeable effect, settling on a random ordering throughout our experiments. This evaluation is crucial in showcasing a model's ability to learn patterns from the patient population, as well as its potential to perform well in various patient simulation and extension applications. We show the results in Table 3 where we see that HALO outperforms  the two compared language model architectures. Upon closer examination, we observed that the LSTM baseline model struggled with the complexity and size of the outpatient EHR dataset, while our proposed model HALO performed comparably to the HALO − Coarse ablation baseline. In contrast, in the inpatient EHR setting, where the visits are shorter but contain more codes, HALO's multigranularity approach proved to be highly effective. Specifically, the model achieved a notable 4% reduction in test BCE loss and a 17% increase in F1 Score when compared to the single granularity HALO − Coarse model. Notably, both HALO models significantly outperformed the LSTM baseline in this setting. These results highlight the significant value of our multi-granularity approach in handling the complex and diverse nature of medical codes in different EHR settings. Additionally, we present perplexity, which evaluates the probability or likelihood of the test set as quantified by a model trained on the training set, normalized by the unit of consideration that we are interested in. In our case, this normalizing unit is the number of medical codes in a patient's medical record (or equivalently number of ones in R). Perplexity is defined mathematically by where is the test dataset and R ( ) is the -th record in . In practice we calculate the values by summing their log probabilities, using the equivalent form The normalized value then also corresponds to how many of the different normalizing units (medical codes) one would have to randomly pick between on average to achieve the same probability. The results of the perplexity evaluation are shown in Table 4. We see similar results as with the classification evaluation with both HALO and HALO − Coarse performing very well on the outpatient EHR dataset (with HALO performing slightly better) as the LSTM baseline struggles, and HALO easily outpacing both baseline methods in this likelihood evaluation for the inpatient EHR dataset, producing a 13% lower perplexity per present code as compared to the HALO − Coarse architecture without the inter-visit modeling. Thus, in   both of these test set evaluations, we see that HALO is much more effective in terms of modeling the underlying distribution of EHRs.

Statistical Similarity to real EHRs
The second analysis evaluates the statistical similarity of the generated and real data. For each methods, we generate a synthetic dataset of the same size as the training dataset. We then compare the unigram and bigram (both within the same visit and across consecutive visits) probabilities for each unique code and pair of codes within the true and synthetic datasets. We perform this evaluation normalized at both the visit and record level, analyzing roughly 10,000 individual codes and over a million pairs of codes. Finally, we also compare the means, standard deviations, and probabilities of certain aggregate statistics such as the number of visits per record, number of medical codes per visit, and the prevalence of each chronic disease label. We show plots of the code probabilities normalized at the record level and a figure containing the chronic disease label probabilities for the outpatient EHR dataset in Figure 3 and Figure 4 respectively. we offer an interactive visualization (allows zoom, pan, and hover over points for specific disease names) of the "HALO vs. Real" disease prevalence plot at https://vega.github.io/. We also provide a table containing the aggregate statistics for both datasets in Table 5. Furthermore, we offer the 2 values for each of the three types of code probabilities normalized at the visit level in both our core high-dimensional outpatient EHR dataset as well as a lower-dimensional setting (with code aggregation and rare code removal down to around 1,300 different prevalent code phenotypes) in Table 6. Finally, we provide the full visit level code probability plots, probability densities underlying the aggregated statistics, and a discussion of the various failure modes of our baseline methods for that evaluation in our supplementary material. HALO again outperforms the baseline methods in each evaluation. Specifically, we see that besides the GPT baseline struggling with the complexity of the outpatient EHR dataset in terms of stopping the record generation (as is common to many language models in the text generation domain as their overall quality decays for long sequences, and the lack of visit level grouping in its data representation causes its sequences to be considerably longer), the language model architectures (GPT, LSTM, HALO − Coarse, and HALO) are able to model both the shape of the synthetic records as well as the temporal dependencies much better on average than the VAE and especially GAN-based baselines. While each of the compared methods model the unigram code probabilities relatively well, this better temporal modeling is shown in the overall synthetic record and visit lengths, the generation of chronic disease labels in the second visit, and the sequential bigram evaluation. However, the LSTM and HALO − Coarse language model baselines falter with respect to same-visit bigram probabilities due to their lack of intervisit dependency modeling while the GPT baseline which models each code individually and so offers that inter-visit modeling is able to maintain relatively stronger performance there. HALO is able to combine and build on each baseline's strengths without any of the weaknesses, using the compact multi-hot representation to offer an extremely powerful model that does not struggle with any length or feature of data while simultaneously maintaining the inter-visit modeling in an even more powerful and structured way. As such, it is able to best maintain performance in this high-dimensional setting and produces state of the art results which closely model the true training data in all settings from record and visit lengths, label probabilities, and finally all combinations of code probabilities. This signifies that HALO is capable of generating data which looks incredibly realistic, at least at the surface level.

Accurate Disease Phenotyping Using Synthetic EHRs
The final evaluation explores the utility of the synthetic datasets for training disease classifiers. To this end, we utilize two different synthetically-supplemented data setups and machine learning classifiers to predict chronic disease labels based on patients' visits in each. In each of the two data setups we use a simple bidirectional LSTM with a single-layer fully connected head classifier to predict chronic disease label(s) based on a patients' visits.
Accurate Disease Phenotyping: The first of the two data setups explores how models perform in real world settings when the training data is either completely synthetic or augmented with synthetic data. We repeat the experiments for each of the 11 chronic disease labels in the outpatient EHR dataset which originate from the list identified by the Centers for Medicare and Medicaid Services and used in the SynPUF dataset [28] and also for each of the 25 chronic disease in the inpatient EHR dataset which originates from the popular benchmark proposed in [41]. For each chronic disease, we randomly extract 2,500 records for training that both do and do not possess that chronic disease phenotype label from each of our 6 synthetic datasets and the real training data, forming 7 balanced training datasets. The number 2,500 was chosen to be large enough for training machine learning models but small enough that each dataset had enough positive labels for each disease. We then train classifiers on each of these datasets for each label. We select the best model for each dataset using a validation set of 250 records of either class from the original validation dataset, and we evaluate on test sets of 500 records of either class from the original test set. We display the average accuracy, F1 score, and rank for each synthetic dataset from each of the compared models across each chronic disease labels in the inpatient EHR dataset in Table 8. For the outpatient EHR dataset we then additionally explore models trained on a training set of real data additionally augmented with With the exception of SynTEG, all models exhibit some correlation in the unigram and temporal bigram evaluations, but many have weak correlation or consistently yield higher synthetic probabilities due to a lack of temporal consistency and repetition across visits in the records. HALO and to a lesser extent, HALO − Coarse perform the best in all settings, while HALO is the only one that can realistically produce pairs of codes within and across visits and achieve state-of-the-art results.   Table 6: We calculated 2 values to measure the correlations of the three types of code probabilities for different synthetic datasets against the training data in both high-dimensional and low-dimensional settings. Although the results showed a drop in performance for each method in the high-dimensional setting, HALO was able to maintain strong performance with minimal decline. Overall, our proposed method achieved state-of-the-art performance, outperforming the baselines in both bigram evaluations in low and high dimensional settings.
each of those synthetic datasets, and we show those aggregated results of mean test set classification performance across the 11 label-based tasks are shown in Table 7. We provide a full set of results by chronic disease label in our supplementary material. In both datasets, we can see that each of GPT, HALO − Coarse, and HALO's data largely maintain the performance of real training data and offer large improvements over the SynTEG, EVA, and LSTM baselines. HALO then offers the best results in having the least drop off among the three on average when used to train in the absence of real data and also the most improvement in performance over just the real training data when used as an augmentation technique.
Phenotyping of Rare Conditions: We conducted a simulation to demonstrate the usefulness of synthetic EHR data in identifying uncommon conditions. We extracted a highly imbalanced dataset of patients labeled with the cancer chronic disease from the outpatient EHR dataset. The dataset consisted of 50,000 EHR records from the original outpatient EHR training data without the cancer chronic disease label and just 1,000 with the label. We trained a classifier on this imbalanced data and compared their performance to classifiers trained on data balanced by adding 49,000 positively labeled records from each of our synthetic datasets.
We then also baselined with a classifier trained on an upper bound ideal dataset balanced using real data.
The results of the evaluation are shown in Table 9. In particular, HALO outperforms each of the baselines, offering large gains on the original unbalanced dataset as well as the other synthetically augmented datasets and approaching the upper bound performance of the ideal balanced dataset.
This simulation shows the potential of synthetic EHR data to support the identification of uncommon conditions and highlights the value of using balanced data for training classifiers.

Realistic Continuous Variables in Synthetic EHRs
We conclude with a brief exploration to demonstrate the viability of our discretized representation of continuous values, and HALO's effectiveness in using it to model those variables. We build new training datasets including visit gaps in the outpatient EHR dataset and lab values in the inpatient EHR dataset. We use these dataset to train a new version of our model and generate another synthetic dataset of 250,000 and 45,000 records respectively. We then show that the distributions of those variables match the real values. In Figure 5 and Table 10, we show that HALO accurately The SynTEG and LSTM baselines both struggle with temporal consistency as manifested through their weak ability to create these chronic disease labels in the łlabelž visit, so they are omitted from the plot. In contrast, the EVA, HALO − Coarse, and HALO architectures all closely mirror the training data with HALO and EVA performing the best overall on average.  Table 7: We compare the performance of chronic disease classification models trained on different types of training data in the outpatient setting -real data, synthetic data generated by different methods, and real data augmented by synthetic data. GPT, HALO − Coarse, and HALO's synthetic data perform better than the other methods, and are comparable to using real data as training data. Augmenting real data with HALO's synthetic data leads to better performance than just using real data. HALO has the best results, with little drop-off in performance compared to real data and the largest gain when used to augment the training set.
replicates the gaps between patient visits and the pattern of shorter gaps for longer records. In Figure 6, we demonstrate that HALO replicates not only the presence but also the average values of performed lab tests. Specific labs included (corresponding to points in   Table 9: The results of a variety of binary classification metrics on the test set for the simulated rare-disease detection task comparing models trained on datasets balanced using each of the synthetic datasets and baselined against models trained on the original imbalanced data (representing the rare disease dataset) an upper bound ideal dataset balanced using real data. EVA and SynTEG fail to offer much utility while the language model architectures LSTM, GPT, and HALO − Coarse offer a lot of value. However, HALO achieves state of the art results and closely approximates the performance of a true, balanced dataset.  Table 10: The average gap between visits in number of days in the outpatient EHR training dataset and the synthetic HALO dataset created using the augmented method to handle additional continuous variables. The full probability distributions underlying these numbers can be seen in Figure ??.
those two plots) are included in our supplemental material. Overall, HALO's approach to continuous variables is effective, and it has the potential to generate comprehensive synthetic patient records with multiple variables of different types. Figure 5: Two demonstrations of HALO being able to capture the distribution of the gaps between visits in the outpatient EHR dataset variables once the model is augmented to support it. First, examining the mean visit gap by visit number across both the real and synthetic datasets shows that HALO is able to effectively capture the pattern of patients with many records having shorter gaps in their later visits. Second, the probability density of the visits gaps as a whole shows HALO approximating the true shape overall as well.

Privacy Protection of Synthetic EHRs
In addition to demonstrating the high fidelity of synthetic EHRs generated by HALO, we want to ensure that the privacy of the patients within the original training dataset has not been compromised. To that end, we conducted three commonly used privacy evaluations to test its robustness. Our results show that the outstanding performance of HALO is not due to memorization or any other violation of patient privacy.

Membership Inference
Attack. The first evaluation is the ability to thwart a membership inference attack. These attacks aim to determine whether a real patient record was used in the training dataset to generate the synthetic records. Membership inference attacks are a well-known privacy test in the field of synthetic EHR generation, and addressing them is crucial to ensure the privacy and confidentiality of patient identities.
To demonstrate that HALO is not susceptible to such an attack, we show that we can prevent two different attempts at a membership inference attack based on the synthetic data generator and the synthetic dataset itself. We generate an attack dataset by first selecting 100,000 records from each real dataset that were used for training and assigning them a positive label. Then we select 100,000 records from the remaining records not used for training as the negative label set. Next, we conduct two attacks: • In the Model Attack, we label the 100,000 records with the highest log probability from the model as positive, predicting that they were part of the training dataset. • In the Dataset Attack, we label the 100,000 records with the lowest hamming distance to the closest record in synthetic dataset as positive. We pick hamming distance (equivalent to Manhattan Distance in our binary setting) as our distance metric between patient records throughout our privacy evaluations in accordance with [43], but any distance metric could be substituted interchangeably. These two attacks allow us to test the ability of the synthetic dataset to prevent an attacker from inferring whether a real record was used in the training dataset.
We show the results of the classifications from the attacks in Table 11. The accuracy of both attacks on both datasets is approximately 50%, which is similar to a random guess. This shows that neither the model nor synthetic dataset reveal any meaningful or compromising information about patient identity of the training dataset. We also perform the dataset attack with each of our baseline datasets and see that each similarly thwarts it, achieving very similar accuracies of around 50% as well. Note that we don't perform the model attack with the baseline models because most of them don't offer a probability output of input patient records, and the dataset-based attack is the standard one used throughout literature in this domain.

Attribute Inference Attack.
The second evaluation is the ability to thwart a typical attribute inference attack. This attack determines whether the synthetic dataset leaks specific and sensitive patient attributes based on correlations from demographic and other more common, less sensitive attributes of the patient. Consequently, it tests whether the synthetic dataset can be used to learn individual attributes of real patient data.
To demonstrate that HALO is not susceptible to such an attack, we show that it thwarts the nearest neighbor-based attribute inference attack. In this attack, we use subsets of the synthetic dataset and the original training dataset, randomly sampled to match the size of the original test dataset. We define demographic information, chronic disease labels, and the binary presence of the 500 most common medical codes (determined by the training dataset) as the conditional attributes. The sensitive attributes to be identified are the binary presence of the remaining uncommon medical codes.
To conduct the attack, we find the closest patient in the synthetic dataset for each patient in the training set based on having the most shared conditional attributes. We then predict each of the uncommon attributes to be the same as that closest synthetic patient. Those predicted attributes are compared with the ground truth sensitive patient attributes and graded using F1 Score. We then repeat this attack with real patients from the test dataset in place of the synthetic dataset and use the results as a baseline for acceptable attribute inference.
We show the results of the classifications from the nearest neighbor attacks in Table 12. There we see that not only are the prediction F1 Scores incredibly low on both datasets (4.7% for the outpatient dataset and 3.3% for the inpatient dataset), they are crucially lower than the baseline attack from the test set. This attack, labeled łReal Data Attackž in the table, sets the threshold for the amount of information revealed by the patterns of real data. So, staying below that level means incurring only an acceptable amount of attack success. So, we see that the synthetic dataset does not reveal any meaningful insight into the attributes of real patient data. We then see that each of the baseline synthetic datasets pass the test as well by having lower F1 Scores than the real data attack. GPT and HALO-Coarse allow similar F1 Scores to HALO while all of the rest have extremely low scores, likely because they do not capture the real patterns as effectively.

Nearest
Neighbor Adversarial Accuracy Risk. The final evaluation, first proposed in [44], measures the degree to which a model overfits to its training dataset by looking at the relative likelihood of a patient's nearest neighbor being in the same or different datasets. As such, passing this test ensures that a generative model is generating wholly new synthetic patients rather than copying or performing simple augmentation on real training patients.
The evaluation is performed by calculating the metric Nearest Neighbor Adversarial Accuracy (NNAA). Let , , and be random subsets of records (we use = 5,000 records in our experiment) from the training, synthetic, and evaluation datasets respectively. NNAA risk is then the difference where the subscript throughout refers to the evaluation (test) dataset, refers to the synthetic dataset, and refers to the training dataset. 1(·) is then the indicator function and ( ) is the distance from the -th record in the evaluation dataset to its closest record (as determined by hamming distance in accordance with [43] ) in the synthetic dataset. Each of and in ( ) can also be replaced interchangeably with any of , , and , where the calculation just omits the record in question if the two datasets are the same. So, each 1 =1 1( ( ) > ( )) component is the probability of a record in dataset A being closer to another record in its own dataset than any record in dataset B. If they are randomly drawn from the same or similar distributions, we would expect that probability to be 1 2 , but it could be much lower if one of the datasets were copying from the other. We baseline this likelihood of the synthetic dataset copying from both its training and testing datasets, comparing the two to produce our overall risk.
[44] set 0.03 as the threshold for an acceptable NNAA risk. We show in Table 13 that the NNAA values for both our inpatient and outpatient datasets are easily below that mark. Furthermore, we show that as more data is added as with the outpatient EHR dataset, the risk decreases to an extremely small value. So, we show that our HALO method is not overfitting to or copying from its training   Table 11: The results of the two different membership inference attacks using the HALO model. For each record in the attack dataset, we find both the log probability of the record from the trained model (Model attack) and the hamming distance to the closest record in the synthetic dataset (Dataset attack). The attacks then label the half of the records with the highest probability or lowest distance records respectively as in the training set. We see that the accuracy for either attack is right around 50% which is similar to a random guess. This indicates that the synthetic dataset and the model do not reveal any patient identifying information about the original training datasets. We also find that each of the baseline synthetic datasets similarly thwart the dataset attack.  Table 12: The results of a nearest neighbor attribute inference attack. The results showed that the F1 Score on both the inpatient and output datasets was below 0.05, and crucially lower than the baseline attacks using real data from the test set. This baseline attack sets the threshold for the amount of information revealed by the patterns of real data and so staying below it means incurring only an acceptable amount of attack success. This suggests that the synthetic dataset does not reveal any significant insights into the attributes of real patient data, and that HALO is effective in preventing an attacker from inferring sensitive information. We then see that each of the baseline synthetic datasets pass the test as well by having lower F1 Scores than the real data attack. GPT and HALO-Coarse allow similar F1 Scores to HALO while all of the rest have extremely low scores, likely because they do not capture the real patterns as effectively.
dataset and instead is producing wholly new synthetic records. We repeat the evaluation with each of the baseline synthetic datasets and show that they pass as well. HALO thus passes all three privacy evaluations and shows that its impressive performance does not come at the expense of patient privacy.

LIMITATIONS
While we have shown the impressive performance of HALO in both producing high-quality, high-fidelity, and privacy-preserving, we now briefly discuss some remaining limitations. First, the architecture is designed in the model of a large language model. While the multi-modal setup allows the model to condition on more patterns  Table 13: The Nearest Neighbor Adversarial Accuracy (NNAA) risk values for our two datasets. These values are calculated through the likelihood of data in the synthetic dataset being overly similar to records in the training set, normalized by their baseline likelihood of being close to unseen test set data. The metric was proposed in [44] where they set 0.03 as the acceptable risk threshold, a value that both the inpatient and outpatient synthetic datasets are well below. HALO and other baselines all achieve much lower NNAA risk.
per data point and learn more efficiently, our high-performing generator still requires relatively large training datasets which might not be available in some settings. Another important aspect of our model is that it generates synthetic records through a probabilistic process. While it learns realworld patterns during training, there is still a chance that some generated records may not be clinically meaningful. However, this risk can be mitigated through postprocessing with clinical rules that validate the synthetic records. If our model is deployed in the real world, it is important to consider implementing such postprocessing steps to ensure that only clinically relevant synthetic records are produced.
Finally, our HALO model focuses on generating longitudinal EHR data, such as medical codes and lab results. However, other crucial data modalities, such as clinical notes and medical images, are not yet covered by the model. To generate fully comprehensive patient records that include all modalities, it will be necessary to use diverse training data and develop multiple models to handle each modality. This exciting avenue of research is a promising future direction.

CONCLUSION
In this paper, we proposed a new method HALO for generating high-dimensional synthetic longitudinal EHR data. Our method is specifically designed to handle the sequential, multi-granular, and extremely high-dimensional nature of electronic health records by generating an explicit probability distribution over the codes, visits, and records, and HALO can generate realistic data so without needing to aggregate or remove any codes as past approaches have unanimously done. We then showed that HALO can produce incredibly realistic synthetic EHR data. Specifically, we showed that HALO can capture the probability distribution underlying the records better than other language model baselines and then produce a synthetic dataset that both looks similar to and offers the utility of real patient records as measured by medical code occurrence probabilities and machine learning classification tasks augmented with synthetic data. Finally, we also show that our method offers this performance without compromising privacy through several privacy evaluations.
In conclusion, one of the key advantages of HALO is its ability to generate binary sequences that are over a million variables in length. Its impressive performance makes it a promising avenue for developing and sharing realistic but synthetic EHR datasets that can support diverse applications. This represents an exciting opportunity to expand the use of synthetic data in the healthcare field and could help to address some of the challenges associated with data privacy and security.

DATA AVAILABILITY
While the outpatient EHR dataset is proprietary, the MIMIC-III inpatient EHR dataset [25] that we use is publicly available and may be downloaded and used freely after performing training and applying on PhysioNet.