CHeart: A Conditional Spatio-Temporal Generative Model for Cardiac Anatomy

Two key questions in cardiac image analysis are to assess the anatomy and motion of the heart from images; and to understand how they are associated with non-imaging clinical factors such as gender, age and diseases. While the first question can often be addressed by image segmentation and motion tracking algorithms, our capability to model and answer the second question is still limited. In this work, we propose a novel conditional generative model to describe the 4D spatio-temporal anatomy of the heart and its interaction with non-imaging clinical factors. The clinical factors are integrated as the conditions of the generative modelling, which allows us to investigate how these factors influence the cardiac anatomy. We evaluate the model performance in mainly two tasks, anatomical sequence completion and sequence generation. The model achieves high performance in anatomical sequence completion, comparable to or outperforming other state-of-the-art generative models. In terms of sequence generation, given clinical conditions, the model can generate realistic synthetic 4D sequential anatomies that share similar distributions with the real data. The code and the trained generative model are available at https://github.com/MengyunQ/CHeart.

C ARDIAC imaging plays an essential role in cardiovas- cular image diagnosis and management [1].Imaging modalities such as cine cardiac magnetic resonance (CMR) or ultrasound scans reveals the anatomical structure of the heart as well as its contracting and relaxing patterns [2].A classical but long-standing research problem is to explore the associations between the three-dimensional (3D) cardiac anatomy and other non-imaging clinical factors, such as age, gender, diseases [3].Besides 3D anatomical information, the temporal dynamic motion of the heart also contains information that is useful for clinical diagnosis and therapy selection [4]- [6].It is of particular interest to develop computational tools that can bridge between spatial-temporal imaging features and non-imaging clinical factors.In this work, we aim to improve our understanding of the spatial-temporal cardiac anatomy and clinical factors from a generative modelling perspective.We propose a conditional generative model to model the interaction between imaging features and clinical factors.Given clinical factors as conditions, the proposed model can generate corresponding 4D spatial-temporal cardiac anatomies.We demonstrate that the generated 4D anatomies are realistic and consistent with real data distribution.
Lately, the field of conditional generative modelling has made tremendous progress, greatly driven by deep learning methods such as conditional generative adversarial networks (GAN) [7], conditional variational autoencoders (VAEs) [8], [9], flow-based models [10] and diffusion models [11].These approaches enable efficient approximation of the underlying conditional distributions and generation of high-quality samples.Improvements in conditional generative models have been characterised by numerous developments in different generation tasks: image-to-image translation [12]- [14], style and lyrics-to-music generation [15] and text-to-image synthesis [16].
Apart from generating static images [11], generative models have also been applied to sequential data, such as videos [17], [18] and music [15].In these applications, it is important to learn a model that is able to capture the inner connection of temporal sequences.To this end, long short-term memory (LSTM) [19], [20] and transformers [21] have been explored to learn the sequential progression of the latent representations of the samples.Some work also introduces spatiotemporal convolution and attention layers to learn temporal world dynamics from a collection of videos [18].Sequential data contain both structural variations and motion information.Disentangled representation learning approaches such as DiSCVAE [22] have been proposed to separate representations of the motion features from the structural features.
In the field of medical imaging, several papers have explored incorporating non-imaging clinical factors into the image generation process.Dalca et al. [23] proposed a learning framework for building deformable brain image templates conditioned on age.Xia et al. [24] developed a model to generate synthetic brain images conditioned on age and the status of Alzheimer's disease.For cardiac images, Biffi et al. [25] presented LVAE for interpretable classification of anatomical shapes into different clinical conditions.Krebs et al. [26] proposed to learn a probabilistic motion model for spatio-temporal cardiac image registration.Reynaud et al. [27] proposed a causal generative model to generate synthetic 3D ultrasound videos conditioned on a given input image and an expected ejection fraction.Campello et al. [28] proposed a conditional generative model in cardiac imaging to extract longitudinal patterns related to aging.Duchateau et al. [29] built a scheme for synthesizing pathological cardiac sequences from real healthy sequences.Amirrajab et al. [30] developed a framework for simulating cardiac MRI with variable anatomical and imaging characteristics.For cardiac temporal modeling scheme, some work [31]- [33] showed dynamic cardiac data could be described by low-dimensional latent representations, i.e. a conditional autoencoder to capture latent representations of data [33] or temporal smoothness applied as a regularisation term in the reconstruction loss function [32], [33].These works provide useful insights for conditional medical image generation.However, the generation of a sequence of spatialtemporal cardiac anatomies from multiple clinical factors has been less explored.
In this work, we propose a conditional generative model that can generate realistic cardiac anatomical sequences conditioned on non-imaging factors including age, gender, weight, height and blood pressure.We name the Conditional Heart generation model as CHeart.The model employs a variational autoencoder to learn the latent representations for cardiac anatomies and a condition encoder to embed the clinical conditions into a condition latent vector.Then, a Temporal Module is designed to generate the condition-related sequential latent space based on the anatomy latent representations and the condition latent vector.The proposed model demonstrates a high diversity and fidelity in the generation, evaluated using structural overlaps and surface distance metrics, as well as clinical measure (ventricular volume and mass) distributions.The main contributions in this work are summarised as follows: • We propose a spatial-temporal generative model for 3D cardiac anatomy that accounts for both the spatial variations and the temporal variations i.e. motion during the cardiac cycle.

II. METHODS
The proposed generative model takes non-imaging clinical factors as input and generates a cardiac anatomical sequence.Fig. 1 illustrates the overall framework.The following sections provide more technical details.First, we introduce the conditional generative model.Then, we describe the temporal module for learning the sequential latent representations due to cardiac motion.Lastly, we demonstrate two applications of the generative model at the inference stage: anatomical sequence completion and anatomical sequence generation.

A. Conditional generative model
Assume that we observe a dynamic sequence of anatomies of a subject, x t (t = 0, 1, • • • , T − 1), where x t denotes the anatomical segmentation at the t-th frame and T denotes the total number of time frames in a sequence.We also observe some clinical conditions c for this subject, where c could include factors such as age, gender, weight, height, blood pressure etc.Our aim is to learn the probability distribution of the anatomy x conditioned on c with a chosen model, p θ (x|c), where θ denotes the model parameters.We seek a model p θ (x|c) which is sufficiently flexible to be able to describe the data x.Deep neural networks have often been used for this modelling due to its complex modelling capacity [8], [9], [34].Without losing generality, we first attempt to learn the distribution of anatomy at the first time frame, p θ (x 0 |c), which is often the end-diastolic (ED) frame in cardiac imaging.
We adopt the conditional β-VAE model [8], [9], [34] to learn the data distribution.The condition c is embedded as a condition latent vector z c by the MLP, which integrates multiple clinical factors and enables exploration across the conditional latent space.The model consists of a decoder p θ (x 0 |z 0 , z c ) and an encoder q ϕ (z 0 |x 0 , z c ).The decoder p θ (x 0 |z 0 , z c ) with parameters θ maps the latent variables z 0 , z c to the anatomy x 0 .We assume a prior distribution p(z 0 ) over the latent variable z 0 .The prior and the decoder together define a joint distribution, denoted as p θ (x 0 , z 0 |z c ), which is parameterized by θ.
To turn the intractable posterior inference and learning problem into a tractable problem, we introduce a parametric encoder model q ϕ (z 0 |x 0 , z c ) with ϕ as the variational parameters, which approximates the true but intractable posterior distribution p θ (z 0 |x 0 , z c ) of the generative model, given an input x 0 and condition space z c : where q ϕ (z 0 |x 0 , z c ) often adopts a simpler form, e.g. the Gaussian distribution.By introducing the approximate posterior q ϕ (z 0 |x 0 , z c ), the log-likelihood of p θ (x 0 |z c ) can be formulated as: where the second term denotes the Kullback-Leibler (KL) divergence D KL (q ϕ ∥ p θ ), between q ϕ (z 0 |x 0 , z c ) and p θ (z 0 |x 0 , z c ).It is non-negative and zero only if the approximate posterior q ϕ (z 0 |x 0 , z c ) equals the true posterior distribution p θ (z 0 |x 0 , z c ). Due to the non-negativity of the KL divergence, the first term in Eq. 2 is the lower bound of the evidence log[p θ (x 0 |z c )], known as the evidence lower bound (ELBO).Instead of optimising the evidence log[p θ (x 0 |z c )] which is often intractable, we optimise the ELBO: To better control the encoding representation capacity and encourage more efficient latent encoding, we adopt β-VAE by modifying VAE with an adjustable hyperparameter β [34].As a result, the loss function of the generative model is formulated as: where the sign is negated so as we can minimise the loss function.
In practice, we use the reconstruction loss for the first term., i.e. how accurate the generative model p θ (x 0 ) can be for reconstructing the anatomy x 0 from the latent vector z 0 using the decoder.The reparameterization trick is applied to replace the subscript of the expectation and express the random variable z 0 ∼ q ϕ (z 0 |x 0 , z c ) as some differentiable and invertible transformation of another random variable ϵ, so the expectation does not rely on q itself.

LSTM cell
Fig. 2. The temporal module for generating the sequential latent codes z 0:T −1 , constructed with a one-to-many long short-term memory (LSTM) structure.
In the previous section, we modelled q ϕ (z 0 |x 0 , z c ) and p θ (x 0 |z 0 , z c ) for the first frame x 0 in a sequence.Here, to model the whole anatomical sequence x 0 , x 1 , ..., x T −1 on the clinical conditions c, we design a Temporal Module constructed using a one-to-many LSTM structure [35] with parameters ω, which generates the condition-related sequential latent codes based on z 0 and z c .The detailed structure of the temporal module is illustrated in Fig. 2.
LSTM [36] is a variant of recurrent neural networks that consists of gating mechanisms and cell memory blocks.The first LSTM cell of the module takes the concatenation of the anatomy latent representation z 0 and the condition latent representation z c as input, which is denoted as z c 0 .With the hidden state h 0 and cell state cell 0 being initialised to zero, it infers the latent z c 1 at the next time frame.Each following LSTM cell, with shared weights, takes z c t−1 as input, updates the hidden state h t and cell state cell t , and infers the latent z c t .All the LSTM cells have shared weights.Each latent code z c t contains information of both the anatomy at time t and the clinical conditions c.The cardiac anatomy of a dynamic sequence forms a temporal sequence z c t in the latent space, where t = 0, 1, . . ., T .After the temporal module computes the latent codes z c 0:T −1 across all the time frames, the decoder generates the anatomical sequence x ′ t from z c t , illustrated in Fig. 1.
The overall loss function for modelling the anatomical sequence generation is formulated based on Eq.4: The training loss function is composed of two parts: 1) the reconstruction accuracy at all time frames, where we use cross-entropy for evaluating the performance in reconstructing the segmentation maps; 2) the KL divergence term, penalising the discrepancy between the learned prior and posterior distributions.The whole training process is performed endto-end, with the encoder, temporal module and decoder being trained together.The VAE enables the model to learn a low-dimensional latent space that captures the underlying anatomical variations.By incorporating the temporal module, the model can effectively model the temporal dynamics in the cardiac images, enabling the generation of anatomically consistent and coherent sequences over time.

C. Inference
To demonstrate the performance of the proposed generative model at the inference stage, we carry out two benchmark tasks, namely anatomical sequence completion and anatomical sequence generation, as shown in the right panel of Fig. 1.
In anatomical sequence completion, the model is given the anatomy at the first time frame x 0 and clinical conditions c.It is asked to generate the remaining sequence of anatomies across the cardiac cycle.The model maps x 0 and c to their latent representations z 0 and z c , predicts the sequential latent codes z c 0:T −1 through the temporal module and finally reconstructs the full sequence of cardiac anatomy x ′ 0:T −1 using the shared-weight decoders.
In anatomical sequence generation, the model is only conditioned on the clinical factors c and it does not require any anatomy as input.Since the model has learnt the distribution of anatomical latent variable p z0 , we can draw samples z 0 in the latent space from a Gaussian distribution N (0, 1) and concatenate it with the clinical latent code z c .We then provide the concatenated latent code z c 0 to the temporal module to predict z c 0:T −1 and generate the full anatomical sequence x ′ 0:T −1 using the decoder.

D. Evaluation
To evaluate the conditional generative model, we use quantitative measures to assess the generated anatomy, as well as use clinical measures to assess the distribution similarity.
First, we employ the Dice coefficient, the Hausdorff distance (HD) and the average symmetric surface distance (ASSD) which compare the similarity of the generated cardiac anatomy to the ground truth anatomy associated with the same clinical conditions.
Second, we derive five imaging phenotypes including the left ventricular myocardial mass (LVM), LV end-diastolic volume (LVEDV), LV end-systolic volume (LVESV), right ventricular end-diastolic volume (RVEDV) and RV end-systolic volume (RVESV).We evaluate differences between generated data and real data with the same clinical conditions, denoted as d phenotype .Furthermore, these phenotypes are closely associated with age and gender [3].We calculate the distributions of the imaging phenotypes against age and gender, and compare the generated data to the real data.The comparison is illustrated qualitatively using density plots and quantitatively using the Kullback-Leibler (KL) divergence and Wasserstein distance (WD).The KL divergence [37] is an information-theoretic measurement of the similarity between two probability mass functions.Similarly, WD [38] measures the distance between two probability distributions and can be computed as: where (P, Q) is the set of all joint distributions over u and v. WD can be seen as the minimum work needed to transform one distribution to another, where work is defined as the amount of mass that must be moved from u to v to transform P to Q and the distance to be moved.

A. Data sets
A short-axis 3D cardiac MR dataset of 1,383 subjects was used, acquired from Hammersmith Hospital, Imperial College London.Each cardiac cine image sequence comprises 20 time frames (T = 20) covering one complete cardiac cycle, with a spatial resolution of 1.25 mm × 1.25 mm × 2 mm.The temporal resolution ranges from 0.041 to 0.048 seconds per frame, accommodating variations in the heart rate.The cardiac anatomy is described by the image segmentation map with four labels: background, the left ventricle (LV), myocardium (Myo) and the right ventricle (RV).Ground truth segmentation at end-diastolic (ED) and end-systolic (ES) frames was generated by using a multi-atlas segmentation method [39], then quality controlled and manually corrected by an experienced cardiologist using itkSNAP [40].A state-of-the-art nnU-net model [41] was trained using the ED and ES segmentation and then deployed to all time frames generating the 3D-t segmentation, followed by manual quality control.To eliminate the influence of image orientations in the generation, all 3D-t segmentation were rigidly aligned to a template space using MIRTK [42], [43] and cropped to a standard size of 128 × 128 × 64.In this way, the generative model will focus on learning subject-specific variations of the anatomy instead of image orientations.
In terms of demographic information, all subjects were healthy volunteers, with 775 females and 608 males, aged between 18-73 years old, weighed between 33-131 kg, with height between 142-195 cm and systolic blood pressure (SBP) between 79-183 mm Hg.When incorporating the clinical information into the model, age was represented as a categorical factor with seven age groups with an interval of 10 years, from 10 to 80 years old.The dataset was randomly split into three subsets for training (n = 968), validation (n = 138) and test (n = 277).

B. Experimental setup 1) Implementation:
The model was implemented in Py-Torch [44].The encoder q ϕ consisted of four 3D convolution layers, one flatten layer and one bottleneck layer, outputting the latent code z 0 .The condition mapping network was constructed using an MLP, outputting latent code z c for input conditions c.A latent dimension of 32 was used for both z 0 and z c , and 64 for the concatenated latent vector z c 0 .The decoder consisted of one flatten layer and four 3D transposed convolution layers.All convolution and transposed convolution layers in the encoder and the decoder used a kernel size of 4. The temporal module was built with one-layer LSTMCells.The regularisation weight β in β-VAE was set to 0.001.The model was trained using the Adam optimiser with a learning rate of 5 • 10 −4 and a batch size of 8.It was trained for 500 epochs and an early stopping criterion was used based on the validation set performance.The training took 17 hours on an NVIDIA RTX A6000 GPU.
2) Baseline methods: Currently, there is no other existing work for performing conditional generation of 3D-t cardiac anatomies.For comparison, we implemented the following baseline generation methods developed in other application domains, extending them from 2D image generation to 3D-t data generation: CGAN: A conditional version of the generative adversarial network (GAN) originally developed for MNIST images [7].Note that the model can only perform cardiac sequence generation, not sequence completion.
CVAE: The conditional generative model CVAE [9].It was modified to adapt to this application.CVAE applied condition incorporation by concatenating conditions and anatomies in both the encoder and decoder.
CVAE-GAN: A conditional variational generative adversarial network proposed in [45].It is a general learning framework that combines a VAE with a GAN for synthesizing natural images in fine-grained categories.
PCA: The principal component analysis (PCA) [46].It is a classical method for dimensionality reduction, which aims to preserve as much of the variation in data as possible using the principal components.Note that the PCA is only used for performing sequence completion, but not for sequence generation.

C. Sequence completion
A well-known challenge to generative modelling is the difficulty in evaluation, as we normally do not have access to the ground truth data distribution, e.g. the distribution of all possible cardiac anatomies in our case.Therefore, we adopt anatomical sequence completion as a surrogate task for evaluating the model performance.The sequence completion experiments were conducted to assess the ability of capturing the sequential information given the first frame of a cardiac anatomy sequence.One example of sequence completion is shown in Fig. 3.It can be seen in the figure that the generated anatomies across time frames maintain the same heart structures as the ED frame and capture the temporal motion pattern through time, contracting first and then expanding.
The sequence completion accuracy is evaluated between the generated anatomy and ground truth across the whole sequence in terms of the Dice metric, HD and ASSD for three structures: LV, Myo and RV.Table I reports the sequence completion accuracy of the proposed model and compares it to other generative models including CVAE-GAN [45], CVAE [9] and PCA [46].It shows that the proposed model achieves a good sequence completion accuracy with an average Dice metric of 0.874, HD of 5.842 mm and ASSD of 1.462 mm, which is comparable to or outperforms the other three generative models in most metrics.In addition, we conducted evaluations at the basal, mid-cavity, and apical slices.The proposed model achieved an average Dice metric of 0.929, 0.927, and 0.878 for LV at the three locations, surpassing the corresponding metrics of the other three generative models.
We also performed paired student's t-tests between the results generated by our method and those of competing methods.The performance metrics of the proposed model   marked with asterisk in Table I were significantly better than other methods at a p value smaller than 0.05.On a different cardiac MR dataset, [47] reports an average Dice metric of 0.94, 0.88, 0.90 for LV, myocardium and RV, respectively, for inter-observer variability in manual cardiac image segmentation (Table 3 of [47]).The Dice metric of the proposed generative model is close to this value, which indicates its high performance and capability for anatomical sequence completion.

D. Sequence generation
Apart from the sequence completion task, we also perform anatomical sequence generation and evaluate how close the generated anatomical sequences are to the real data.In this experiment, we generate new synthetic anatomies of the heart by providing the clinical conditions as the only input to the model.Given the stochastic nature of the VAE generation, for each set of input conditions, multiple anatomical sequences can be generated.We draw 20 random samples from the Gaussian distribution of the latent vector, and correspondingly generate 20 synthetic anatomical sequences for this input condition set.
We first compare the synthetic anatomies to the real anatomy with the same clinical conditions and evaluate the mean similarity and the best similarity across 20 samples, in terms of the Dice metric, HD, ASSD and differences of clinical measures.This is similar to the random average or random best evaluation in other recent generation works in computer vision [48].Table II shows that the proposed model achieves a reasonably good sequence generation accuracy with a mean Dice metric of 0.713, HD of 10.940 mm and ASSD of 3.023 mm.We also reported the best value of each measurement, with a significantly improved maximum Dice of 0.793, minimum HD of 8.166 mm, and ASSD of 2.049 mm.This perhaps means the proposed method can capture a wide variation of anatomies and thus draw a sample that is close to the real sample.When we compare the differences of clinical phenotypes, Table III shows that our model achieved the lower measurement difference with a mean difference of 25.93 mL, 11.74 mL, 34.63 mL, 15.54 mL and 17.34 g and minimum difference of 6.87 mL, 3.54 mL, 6.88 mL, 5.12 mL and 2.95 g for LVEDV, LVESV, RVEDV, RVESV and LVM, respectively.The results of mean and best values indicate that our model achieves similar (Dice) or better sequence generation accuracy (HD, ASSD, difference in clinical measures) compared to  other methods.The best values of the metrics indicate high fidelity of the proposed generative model, which refers to the degree to which the generated samples resemble the real ones [49], [50].It is important to acknowledge that in anatomical sequence generation, the model is not expected to replicate existing anatomies.But instead, the model generates a plausible anatomy that fulfils certain conditions, which is compared to a real anatomy with the same conditions.
Further, we visualised two examples of anatomical sequence generation in Fig. 4. For each example, we show five random synthetic samples which share the same clinical conditions as the real sample.It illustrates that the LV and RV structures look realistic and their shapes share a high similarity to the real anatomy.The contracting pattern of the ventricles and myocardium from ED to ES frame also looks realistic and similar to the real sample.This demonstrates our model can capture the overall anatomy and temporal dynamics of the heart during generation.The five samples with the same conditions also present certain degrees of variations, which demonstrates the diversity of synthetic data.This is due to the Gaussian sampling part of the generation process and reflects the individual differences between two hearts even if they are of the same gender and age, which can be caused by genetic, environmental, lifestyle and many other factors that are not easily accounted for by the model.
To further evaluate whether fidelity and diversity of the generated samples with respect to the real samples, we assess the distance between their distributions, conditioned on age, a common factor of interest in clinical research.In addition to quantitative assessments, we conducted qualitative comparisons by evaluating the distributions of five clinical measures for both real and synthetic anatomies against age, including LVM, LVEDV, LVEV, RVEDV, and RVEF, illustrated in Fig. 5. Compared to other methods, the synthetic data distributions from our model closely resemble the real distributions and cover the full variability of the real samples.Table IV   The proposed model encoded the anatomical and clinical information z c 0 of the first frame (ED) and generated the latent vectors z c t for the following frames by the temporal module.We use the dimensionality reduction technique, tdistributed stochastic neighbor embedding (t-SNE) [51], to visualise the latent space z c t of the generated anatomical sequences, as shown in Fig. 6.The sequential latent codes z c 0:T −1 start at ED (t = 0) and move along a cyclic path in the latent space.It shows that the generative model can capture the temporal dynamics of the anatomy during the heartbeat and form a cyclic pattern as a real heart [52].More overlapped areas between frames 9 to 18 show that the variation of anatomies is smaller in the relaxation stage, which demonstrates the nonlinear trajectories of cardiac motion.We plotted one example of the anatomical sequence at time frame 0, 3, 4, 6, 9, 12, 15, 18 in the figure.Through the time frames, the anatomies present first decreased and then increased LV volumes.The thickness of Myo has the opposite trend, which is consistent with the contraction and relaxation pattern of the heart [53].

F. Condition manipulation
With the conditional generative model, we are able to simulate the change of anatomy when certain conditions (e.g.age) change.Fig. 7(a) shows a series of generated anatomies during ageing, when the condition age increases but all the other conditions as well as the latent vectors drawn from the Gaussian distribution are fixed.The difference map comparing the aged anatomy to the anatomy at 10-20 years old shows subtle changes to the LV and RV structures.We further generate 200 random samples of the synthetic ageing anatomies and derive the clinical measures.Fig. 7(b) illustrates the longitudinal evolution of these measures, stratified by gender.We observe a longitudinally increasing trend in LVM during ageing and a decreasing trend in LVEDV, consistent with findings in clinical literature [54] (Figure 3 of [54]).It demonstrates the potential of using this model for simulating anatomical data distributions.However, we need to be cautious in interpreting this result, as our training data is cross-sectional instead of longitudinal and also the mechanism of cardiac ageing is complex, confounded by more factors (genetics, lifestyle etc) than the five conditions we used in this work.

IV. DISCUSSION
The proposed model is built upon a β-VAE for learning the latent space of the cardiac anatomy.It integrates a conditional branch to model the influence of multiple clinical factors on the generation process and uses a temporal module to model the temporal relationship of anatomical latent vectors during cardiac motion.The experiments demonstrate good performance in both anatomical sequence completion and sequence generation tasks, qualitatively and quantitatively.The model enables condition manipulation for demonstrating the impact of clinical factors on anatomical shape variation.When using the common clinical measures (ventricular volumes and mass) for evaluation, the distribution of generated anatomies is close to the real data distribution visually (Fig. 5) and quantitatively (Table IV), which indicate both the fidelity and diversity of the generation.While the model performs well in generating anatomically coherent structures, further improvement can be made in terms of achieving a closer similarity between the distribution of generated anatomies and real data distribution.There is also potential for further exploration of the relationship between cardiac motion and clinical conditions We foresee there are several potential downstream tasks for the generative cardiac anatomy model, including discovering patterns in large datasets, facilitating out-of-distribution detection and generating synthetic data etc.First, by training a generative model on a large dataset of cardiac anatomies, the trained model can capture complex patterns and variations of the anatomy associated with different clinical factors.This knowledge can be valuable for understanding populationlevel characteristics, identifying risk factors and informing public health strategies.Second, by learning the distribution of normal cardiac anatomy and dynamics, the proposed model can identify patterns of a given anatomy that deviate from the norm, indicating potential anomalies that require further investigation.More importantly, the proposed method is a conditional generative model, which means it can learn the norm specifically for certain conditions (e.g. a gender and age group) and evaluate the deviation from the norm in a personalised manner.Third, the trained generative model can provide a large amount of synthetic data for other tasks.Synthetic data can be used for performing data augmentation for training machine learning models [55], creating synthetic fair data to improve the fairness of prediction models [56], [57], or used as digital anatomies for performing in-silico trials [58].Diverse and realistic synthetic data will alleviate the data scarcity issue in the medical field, where real data are often limited or not easy to share.This includes the creation of synthetic data for privacy-preserving research [59], [60].
There are a few limitations of this work.The first limitation is the high computational cost during training to learn the spatio-temporal patterns from 4D data, even after cropping the images to 128 × 128 × 64 and using sequences of only 20 time frames.An interesting future direction is to reduce the computational complexity of high-dimensional and highresolution medical imaging data.Second, here we use a segmentation map as a representation of the anatomy so that the generative model can focus on learning the variations of anatomy, instead of intensity image styles.Future explorations could be extended to the generation of intensity images for the heart [30] or using mesh as a representation for the anatomy [61], which may be computationally more efficient.Third, we use a cross-sectional imaging dataset of mainly healthy volunteers for training the generative model, due to the challenge of curating large-scale longitudinal datasets with high spatial resolution.It would be interesting to extend this to longitudinal and clinical imaging cohorts with cardiac diseases in the future.

V. CONCLUSION
In this work, we propose a novel conditional generative model that is able to synthesise spatial-temporal cardiac anatomies given clinical factors as input.It demonstrates the feasibility of generating highly realistic synthetic 3Dt anatomies for the heart that captures both the anatomical variations and motion of the heart.The work paves the way for further generative modelling research in cardiac imaging, such as incorporating disease types or representing anatomy as meshes.It also has the potential to be applied to downstream tasks, such as performing data augmentation based on various anatomies, building condition-specific atlases and performing biomechanical modelling of the heart etc.

Fig. 1 .
Fig. 1.Overview of the CHeart model including training and inference stages.During training, an encoder is applied to learn the latent representations zc, z 0 for the clinical conditions c and anatomy at the first time frame x 0 .A temporal module models the trajectory of z c 0:T −1 in the latent space across the temporal dimension from the initial latent vectors zc and z 0 .The decoder then generates the 4D cardiac anatomy sequence x 0:T −1 from the latent vectors on the temporal trajectory.The training process enables two inference mechanisms at test time: sequence completion and sequence generation.In sequence completion, the model is given x 0 and c, and generates the remaining sequence of anatomies in the cardiac cycle.In sequence generation, a random latent code z 0 sampled from the prior distribution and c are given to the model and the temporal module to generate the latent vector sequence z c 0:T −1 , which are used to generate synthetic cardiac anatomical sequence x ′ 0:T −1 .

Fig. 3 .
Fig. 3.An example of sequence completion, arranged in two rows with the left-to-right and top-to-bottom order.With the end-diastolic (ED) frame in time t = 0 and conditions c as input, the model generates the remaining anatomical sequence at time frame t = 1-19, shown within the gray box.The top row depicts anatomy images at time frame t = 0 − 9, and the bottom row at time frame t = 10 − 19.

Fig. 4 .
Fig. 4. Visualisation of synthetic anatomies (last five columns) generated by the model, compared to the real anatomy (first column) with the same clinical conditions (text annotation).The whole anatomical sequence is generated but only ED and ES frames are shown here.The first and second rows of each example show the ED and ES frames of the cardiac anatomical sequence.

Fig. 5 .Fig. 6 .
Fig. 5. Distributions of clinical measures for real data and synthetic data.Each graph displays a kernel density plot of an imaging phenotype (LVM, LVEDV, LVESV, RVEDV, RVESV) against age.For each plot, the x-axis denotes age and the y-axis denotes the value of the imaging phenotype.Darker areas in the plot indicate the regions where the data is more concentrated.Lighter areas show the regions where the data is sparser.

Fig. 7 .
Fig. 7. (a) An example of the synthetic cardiac anatomy during ageing.The first and third rows show the cardiac anatomies at end-diastolic (ED) and end-systolic (ES) frames.The second and fourth rows show the difference maps between the aged anatomy 20-80 years old and the anatomy at 10-20 years old.(b) The simulated evolution of clinical measures (LVM, LVEDV, LVESV, RVEDV, RVESV) by generating 200 samples of gender-specific ageing cardiac anatomy and plotting their mean measures with 95% confidence interval.

TABLE I THE
SEQUENCE COMPLETION PERFORMANCE OF DIFFERENT MODELS IN TERMS OF DICE, HAUSDORFF DISTANCE (HD), AVERAGE SYMMETRIC SURFACE DISTANCE (ASSD).MEAN AND STANDARD DEVIATION ARE REPORTED.ASTERISKS INDICATE STATISTICAL SIGNIFICANCE ( * : P ≤ 0.05) WHEN USING A PAIRED STUDENT'S t-TEST COMPARING THE

TABLE II COMPARISON
OF SEQUENCE GENERATION PERFORMANCE BETWEEN CGAN, CVAE, CVAE-GAN AND THE PROPOSED MODEL, IN TERMS OF MEAN AND BEST DICE METRIC AND CONTOUR DISTANCE METRICS FOR THE AVERAGE PERFORMANCE OVER LV, RV AND MYO.THE BEST VALUE ACROSS 20 SAMPLES FOR DICE METRIC (MAXIMUM), HD (MINIMUM) AND ASSD (MINIMUM) ARE REPORTED.ASTERISKS INDICATE STATISTICAL SIGNIFICANCE ( * : P ≤ 0.05) WHEN USING A PAIRED STUDENT'S t-TEST COMPARING THE PERFORMANCE OF THE PROPOSED METHOD TO OTHER METHODS.