DeepMesh: Mesh-based Cardiac Motion Tracking using Deep Learning

3D motion estimation from cine cardiac magnetic resonance (CMR) images is important for the assessment of cardiac function and the diagnosis of cardiovascular diseases. Current state-of-the art methods focus on estimating dense pixel-/voxel-wise motion fields in image space, which ignores the fact that motion estimation is only relevant and useful within the anatomical objects of interest, e.g., the heart. In this work, we model the heart as a 3D mesh consisting of epi- and endocardial surfaces. We propose a novel learning framework, DeepMesh, which propagates a template heart mesh to a subject space and estimates the 3D motion of the heart mesh from CMR images for individual subjects. In DeepMesh, the heart mesh of the end-diastolic frame of an individual subject is first reconstructed from the template mesh. Mesh-based 3D motion fields with respect to the end-diastolic frame are then estimated from 2D short- and long-axis CMR images. By developing a differentiable mesh-to-image rasterizer, DeepMesh is able to leverage 2D shape information from multiple anatomical views for 3D mesh reconstruction and mesh motion estimation. The proposed method estimates vertex-wise displacement and thus maintains vertex correspondences between time frames, which is important for the quantitative assessment of cardiac function across different subjects and populations. We evaluate DeepMesh on CMR images acquired from the UK Biobank. We focus on 3D motion estimation of the left ventricle in this work. Experimental results show that the proposed method quantitatively and qualitatively outperforms other image-based and mesh-based cardiac motion tracking methods.

Corresponding author: Qingjie Meng Q. Meng is also with School of Computer Science, University of Birmingham.
W. Bai is also with Department of Brain Sciences, Imperial College London.D. P. O'Regan is with the MRC London Institute of Medical Sciences, Imperial College London, W12 0HS, UK. (e-mail: declan.oregan@imperial.ac.uk).
D. Rueckert is also with Klinikum rechts der Isar, Technical University Munich, Germany.
diagnosis of myocardial diseases [10,32].Recent works utilize 3D surface meshes to represent the anatomy and assess the ventricular structure and function from meshes, e.g., quantifying pathological cardiac remodeling [22] or characterizing LV motion phenotypes [12].However, it remains a challenging problem to estimate cardiac motion on meshes directly from images, in particular, to keep the same mesh structure and vertex correspondence.Most recent cardiac motion tracking approaches utilize cine CMR images to estimate a dense motion field which represents pixel-/voxel-wise deformation in the image space, e.g., [4,6,21,24,32,33,34,52,54].Mapping the deformation from a pixel-/voxel-wise representation to a vertex-wise representation on a cardiac mesh is typically inefficient and can reduce the accuracy of motion estimation.Specifically, a 2D pixel-wise motion field only considers the motion of the heart within a single view plane and does not provide complete 3D motion information.Using post-processing steps to convert 3D voxel-wise motion fields to 3D vertex-wise displacements may impair motion estimation accuracy due to interpolation.
In this work, we propose a novel learning-based method DeepMesh for estimating 3D cardiac motion on the heart mesh from 2D cine CMR images.The proposed method propagates a single template mesh to individual subjects and estimates both in-plane and through-plane motion on meshes by integrating information from short-axis (SAX) and longaxis (LAX) view images.Specifically, DeepMesh first utilizes a template heart mesh containing the epi-and endocardial surfaces to reconstruct the mesh at the end-diastolic (ED) frame for an individual heart from the input ED frame multiview images.By deforming this template mesh, the proposed approach maintains the same mesh structure at the ED frame for all subjects.Subsequently, the multi-view images at the ED and t-th frames are utilized to directly estimate the 3D motion on the mesh.The estimated mesh motion explicitly shows the 3D displacement of each vertex from the ED frame to the t-th frame, and thus is able to maintain mesh structure and vertex correspondences between time frames.A differentiable meshto-image rasterizer is introduced during training to generate 2D soft segmentations from the 3D mesh.By comparing predicted 2D soft segmentations with ground truth 2D segmentations, the differentiable rasterizer allows leveraging of 2D multi-view anatomical shape information for both 3D mesh reconstruction and motion estimation.During inference, our model generates a sequence of meshes, which characterise the heart motion across the cardiac cycle.Here, in this work, we model the left ventricle as a 3D mesh consisting of epi-and endocardial surfaces and estimate the LV myocardial motion.
Contributions: This paper extends a preliminary version of the work presented at the MICCAI 2022 conference [23].In addition to the work in [23], the main contributions in terms of methodology and evaluation are summarized as follows: • We additionally introduce a template-based mesh reconstruction module.This module reconstructs the ED frame mesh of individual subjects from a cardiac template and therefore, enables subsequent mesh-based motion tracking.With the mesh derived from the template, the proposed method is able to maintain the number of vertices and faces in the cohort.• We add a new regularization term to the motion estimation module in [23] and demonstrate that this leads to an improved performance in motion tracking.• We conduct a more thorough experimental analysis of the proposed method.We quantitatively and qualitatively evaluate the performance of mesh reconstruction and mesh-based motion tracking.We additionally compare the proposed method with two state-of-the-art motion tracking methods which use multi-view images [23,24].Moreover, we perform an extensive ablation study with respect to anatomical views, loss combinations and hyperparameter selections.

A. Image-based motion estimation
Many cardiac motion estimation methods, including conventional methods and deep learning-based methods, consider motion tracking within image space.They typically use image registration algorithms to estimate 2D pixel-wise or 3D voxelwise motion fields.
1) Conventional methods: Image registration has been applied to cardiac motion estimation in previous works.For example, the free form deformation (FFD) method for nonrigid image registration [38] has been widely used for cardiac motion estimation in many recent works, e.g., [4,6,7,31,32,39,40,45].De Craene et al. [11] introduced continuous spatio-temporal B-spline kernels for computing a 4D velocity field, which enforced temporal consistency in motion recovery.Thirion [44] developed the demons algorithm which utilizes diffusing models for image matching and further used it for cardiac motion tracking.Based on this work, Vercauteren et al. [46] introduced a non-parametric diffeomorphic image registration method which has been used for cardiac motion tracking [34].
2) Deep learning-based methods: In recent years, deep convolutional neural networks (CNNs) have inspired the exploration of deep learning-based cardiac motion estimation approaches [14].Qin et al. [33] proposed a joint deep learning network for simultaneous cardiac segmentation and motion estimation.Their method contains a shared feature encoder which enables a weakly-supervised segmentation.The U-Net architecture [37] has been widely used for learning-based image registration [5,51] and further for cardiac motion estimation.For example, Zheng et al. [55] proposed a method for cardiac pathology classification based on cardiac motion.This method utilizes a modified U-Net to generate flow maps between the ED frame and any other frame.Balakrishnan et al. [5] used 3D U-Net to build VoxelMorph for learning-based deformable image registration.Their registration method has been utilized in other cardiac motion tracking works, e.g., [43].Different from most of these previous deep learning-based methods that aim at 2D motion tracking by only using SAX stacks, Meng et al. [24] focused on 3D motion tracking by fully combining multiple anatomical views.In their work, a deep learning model was proposed that learns 3D motion fields from a set of 2D SAX and LAX cine CMR images, which was able to estimate both in-plane and through-plane myocardial motion.Regarding cardiac motion tracking in multiple datasets, Yu et al. [54] considered the distribution mismatch problem and proposed a meta-learning-based online model adaption framework.Towards motion tracking in tagged MRI images, Ye et al. [52] proposed a deep learning model where the motion fields between any two consecutive frames are first computed, and then combined to estimate the Lagrangian motion field between the ED frame and any other frame.Our method aims at 3D cardiac motion tracking from 2D images of multiple anatomical views.In contrast to [24] which estimates 3D motion in image space, our method focuses on estimating 3D motion in mesh space.

B. Mesh-based motion estimation
In contrast to dense motion estimation in image space, several other methods focus on anatomical motion estimation in mesh space [48].These approaches explore mesh matching or mesh registration to estimate the motion field of the mesh.For example, Papademetris et al. [28] proposed a method that uses a biomechanical modeling and shape-tracking approach to estimate the motion of the myocardial mesh.Pan et al. [27] built a 3D mesh to represent material points inside the left ventricle wall and extended the 2D Harmonic phase (HARP) technique [25] to 3D for motion tracking of the mesh through a cardiac cycle.Abdelkhalek et al. [1] built a framework to compute mesh displacements via point clouds alignment.These mesh motion estimation approaches compute mesh motion fields only from dynamic shape information, without considering intensity information from images.In contrast, our method combines image information with the myocardial mesh which contains the epi-and endocardial surfaces of the heart.We estimate 3D motion fields on meshes by using the intensity information of 2D images from multiple anatomical views.

C. Mesh reconstruction
In practice, the 3D mesh of the heart is not always available.Reconstructing a 3D mesh from images has been well investigated in the literature of general computer vision.Conventional approaches are based on multi-view geometry [17].Although they can obtain high-quality reconstruction, these approaches are limited by the coverage provided by the multiple views.More recently, deep learning-based approaches are the major trend of 3D shape generation and they can reconstruct 3D meshes from only single or few images.Because of the difficulty of directly generating a feasible mesh structure, Leveraging shape information

Leveraging shape information
Fig. 1: An overview of the proposed method.Panel (a) describes the mesh reconstruction module which reconstructs the ED frame mesh from a template mesh and the ED frame multi-view images.Panel (b) is the mesh motion estimation module, which takes multi-view images as input and learns 3D mesh motion field ∆V 0→t .By updating the reconstructed ED frame mesh with ∆V 0→t , the mesh of the t-th frame is predicted.During training, a differential mesh-to-image rasterizer is introduced to extract different 2D anatomical view planes from the predicted 3D meshes, which generates 2D soft segmentations.By comparing the predicted soft segmentations with ground truth segmentations, the rasterizer enables leveraging 2D shape information for 3D mesh reconstruction and motion estimation.Losses of each module are shown in Fig. 2 and Fig. 4, accordingly.
In medical imaging, 3D shape reconstruction of the heart has been studied in the literature.For example, Villard et al. [47] proposed a data fitting method for cardiac surface reconstruction from 2D cardiac contours.This method iteratively optimizes the surface smoothness term and the contour matching term to obtain the 3D mesh of the heart.However, this method obtains meshes without maintaining vertex correspondences across the cohort.Bello et al. [6] extracted heart surface meshes from image segmentations using the marching cube algorithm.This method also does not maintain vertex correspondences.Romaszko et al. [36] proposed a deep neural network to predict point clouds of the heart from images.Following their work, Joyce et al. [19] proposed a mesh fitting method which iteratively optimizes shape parameters (e.g., scalars, orientations) in order to match a mesh to the input 2D segmentations.Xia et al. [50] proposed a method that uses CNNs for statistical shape modeling, in particular, adding phenotypic and demographic information for shape reconstruction.Their method estimates shape parameters and transformation parameters to deform the mean shape of the population for each subject.However, their method needs conventional registration algorithms to generate reference 3D shape information for model training, i.e., reference shape parameters and reference transformation parameters.Different from these previous works, we build a deep neural network that directly predicts the 3D surface mesh of the heart at the ED frame by deforming a cardiac template according to the input 2D multi-view cine CMR images.Our method is able to reconstruct corresponding heart meshes across different subjects, i.e. with a consistent number of vertices and faces.

III. METHOD
Give a set of CMR images, our goal is to propagate a single template mesh to all subjects and for individual subjects to estimate the heart motion on meshes across the cardiac cycle.Our task is formulated as follows: Let {V tpl , F } denote the template mesh, {I sa 0 , I 2ch 0 , I 4ch 0 } denote the 2D SAX, LAX 2-chamber (2CH) and LAX 4-chamber (4CH) view images of the heart at the ED frame and {I sa t , I 2ch t , I 4ch t } denote the multi-view images at the t-th frame.V tpl and F refer to the vertices and faces of the template mesh.T is the number of frames in the cardiac cycle and 0 ⩽ t ⩽ T − 1.We want to reconstruct the 3D heart mesh of individual subjects at the ED frame ({ V0 , F }) from the template, and then, for individual subjects, to estimate a 3D mesh motion field ∆V 0→t between the ED and t-th frame by using the corresponding multi-view images.Here, ∆V 0→t represents the motion of each vertex from the ED frame to the t-th frame, {V tpl , V0 , ∆V 0→t } ∈ R N ×3 and N is the number of vertices.
The schematic architecture of the proposed method is shown in Fig. 1.The proposed method can be separated into two main components: First, a mesh reconstruction module reconstructs the 3D mesh of the heart at the ED frame for individual subjects by deforming the template mesh (shown as the red box in Fig. 1).Second, a mesh motion estimation module learns the motion of a myocardial mesh from multiview intensity images and deforms the ED frame mesh to the t-th frame based on the learned 3D mesh motion field (shown as the blue box in Fig. 1).During model training, a differentiable mesh-to-image rasterizer is introduced to yield 2D segmentations of the myocardium in the corresponding 2D planes (in the SAX and LAX orientations) by rasterizing the estimated 3D myocardial mesh.This enables using 2D segmentation information to supervise the mesh reconstruction and motion estimation modules.

A. Mesh reconstruction
This module aims to reconstruct the myocardial mesh for individual subjects at the ED frame.In particular, we leverage multi-view input images to learn a displacement ∆V tpl→0 that deforms the template mesh to the ED frame mesh of individual subjects vertex-by-vertex.Framework shown in Fig. 2.
1) Deformation estimation: We estimate ∆V tpl→0 from the input multi-view images of the ED frame.Specifically, a deformation network composed of a 2D CNN and a 3D CNN is introduced to learn an intermediate 3D voxel-wise displacement Φ tpl→0 from the 2D input SAX and LAX view images.The diagram of the deformation network architecture is shown in Fig. 3 (a), where 2D convolutional layers learn 2D features from input images, followed by 3D convolutional layers that further learn 3D representations and predict Φ tpl→0 .Subsequently, a grid sampler is utilized to generate ∆V tpl→0 from the obtained Φ tpl→0 .In detail, for each vertex of the input template, its displacement is sampled from Φ tpl→0 by using bi-linear interpolation at the coordinates of this vertex.Therefore, ∆V tpl→0 contains the displacement of each vertex from the template mesh to the ED frame mesh.
We formulate the deformation estimation as follows, Here, H D (•) is the deformation network, S(•, •) is the grid sampler and Φ tpl→0 = H D (I sa 0 , I 2ch 0 , I 4ch 0 ).2) Reconstructing the ED frame mesh: With the estimated ∆V tpl→0 , the ED frame mesh ({ V0 , F }) of individual subject can be reconstructed by deforming the input template ({V tpl , F }), A Laplacian smoothing loss1 L tpl→0 smooth is used to evaluate the smoothness of the reconstructed ED frame mesh.The Laplacian of a vertex vi 0 is defined by L(v i 0 ), Here, {v i 0 , vj 0 } are vertices on V0 and N i is the set of adjacent vertices to vi 0 .A surface loss L surf penalizes the similarity between the reconstructed mesh ({ V0 , F }) and the ground truth mesh ({V 0 , F }) of the ED frame.We use the Chamfer distance2 as the implementation, (4) In addition, we utilize the Huber loss used in [24,33] as a regularization term to encourage a smooth intermediate Φ tpl , ( Same to [24,33], ϵ is set to 0.01.q i is the i-th voxel and Q denotes the number of voxels.
As we aim to learn a 3D dense deformation from 2D sparse images for mesh reconstruction, the current losses have difficulty to guarantee accurate performance.To address this problem, we introduce a shape constraint from 2D segmentations as an additional regularization.This regularization term L tpl→0 shape is described in detail in Sec.III-C.

B. Mesh motion estimation
In this module, we take multi-view images of the ED frame and the t-th frame as input to estimate a vertex-wise 3D mesh motion field ∆V 0→t .Then, we predict the mesh at the t-th frame by deforming the ED frame mesh reconstructed in the previous module using the 3D motion field ∆V 0→t .Fig. 4 shows the overview of this module.

1) Motion estimation:
We estimate ∆V 0→t from the input images via predicting an intermediate voxel-wise 3D motion field Φ 0→t .In detail, we build a motion network which consists of a 2D CNN and a 3D CNN to first learn Φ 0→t .This motion network combines 2D multi-view images at both the ED frame and the t-th frame to estimate the intermediate 3D voxel-wise motion field Φ 0→t .The diagram of the motion network architecture is in Fig. 3 (b), where 2D convolutional layers learn 2D features from two time frames and 3D convolutional layers predict Φ 0→t .The obtained Φ 0→t represents the motion of image voxels from the ED frame to the t-th frame.Then, a grid sampler is utilized to generate ∆V 0→t from the obtained Φ 0→t based on the vertices on the reconstructed ED frame mesh ( V0 ) and bi-linear interpolation.∆V 0→t represents the motion field of each vertex from the ED frame to the t-th frame.Overall, ∆V 0→t is estimated from the input multi-view images by Here, H M (•, •) is the motion network and Φ 0→t = H M (•).
2) Mesh prediction: With the estimated ∆V 0→t , the reconstructed ED frame mesh ({ V0 , F }) can be deformed to the t-th frame ({ Vt , F }) by As ground truth mesh displacement is usually unavailable, ∆V 0→t can not be directly evaluated.Instead, we evaluate Φ 0→t in a self-supervised manner.We transform the SAX stack of the t-th frame (I sa t ) to the ED frame using Φ 0→t via a spatial transformer network [18].By minimizing the image similarity loss in Eq. 8, Φ 0→t is encouraged to reflect the motion of the myocardium.
Similar to Eq. 3, the smoothness of the predicted t-th frame mesh is evaluated by a Laplacian smoothing loss 1 L 0→t smooth .The gradients of the intermediate Φ 0→t are penalized by the Huber loss similar to Eq. 5, For mesh motion estimation, we also introduce a shape constraint to better learn 3D dense deformation from 2D sparse images.This regularization term (L 0→t shape ) is described in detail in Sec.III-C.

C. Differentiable mesh-to-image rasterizer
As ground truth 3D deformation is usually unavailable, we want to use 2D anatomical shape information to further supervise both 3D mesh reconstruction and motion estimation.To achieve this, we propose a differentiable mesh-to-image rasterizer to extract 2D soft contours of the myocardium from the predicted 3D heart mesh at the ED frame and the t-th frame.By comparing with the ground truth 2D myocardial contours, the differentiable rasterizer enables using sparse 2D shape information from multiple views to supervise 3D mesh reconstruction and motion estimation.
The input of the differentiable rasterizer is the predicted 3D mesh of the myocardium { Vs , F }.The outputs are 2D contours of the myocardium intersecting on the SAX, 2CH and 4CH view planes ({P sa s , P 2ch s , P 4ch s }).Here s = {0, t} refers to the ED frame and the t-th frame, respectively.When extracting a 2D plane from the 3D mesh, the vertices on the 3D mesh may not perfectly lie in the 2D plane.Therefore, we compute the probability of vertices lying on the plane, which is important for maintaining the differentiability.Specifically, we use probability maps to represent the 2D soft contours of the myocardium.Each pixel on the probability map represents the probability of a vertex from the 3D myocardial mesh lying on a specific 2D plane.The closer a vertex to a plane, the higher probability the vertex lies on the plane.Fig. 5 illustrates the rasterizer.
In detail, the coordinates of a vertex vi s (v i s ∈ Vs , i = [0, 1, ..., N ]) are first transformed to the image space of different anatomical planes using the information about the relative position in the DICOM header of 2D images, e.g., (x ik s , y ik s , z ik s ) is the transformed coordinates of vi s and k is the target 2D plane.Then, the probability of each vertex being on plane k is estimated according to their distance: Here p ik s refers to the probability of vi s belonging to the plane k and τ is the hyper-parameter which controls the sharpness of the exponential function.d ik s is the distance between vi s and the plane k, and z k is the slice corresponding to the plane k.The vertices satisfying d ik s < 1 are selected as the intersection of 3D mesh { Vs , F } and 2D plane k.The probability values of these vertices form the probability map P k s .The obtained 2D probability maps are compared to 2D ground truth binary segmentations {B sa s , B 2ch s , B 4ch s }.Here, only ground truth contours of the myocardium are used and we compare between contours.We utilize a weighted Hausdorff distance 3 (WHD(•, •)) [35] to measure the similarity between these contours.L tpl→0 shape is the shape regularization term for the mesh reconstruction module, L 0→t shape is the shape regularization term for the mesh motion tracking module.It is the same format to Eq. 10 but the input are P k t , B k t .As we use an exponential function (Eq.9) for the rasterization, when minimizing the loss function (e.g., Eq. 10), the gradient can be back-propagated to train the networks.Therefore, the exponential function enables the differentiability of the rasterization, and thus enables end-to-end model training.

D. Optimization
Our model is trained by two stages.The first stage is to train the mesh reconstruction module (i.e., Deformation Network H D (•)) by minimizing L recon (Eq.11).The inputs are the template mesh and the multi-view images at the ED frame.The output is the vertex-wise displacement which deforms the template mesh to the individual subject.
The second stage is to train the mesh motion estimation module (i.e., Motion Network H M (•)) by minimizing L motion (Eq.12).The inputs are the multi-view images of the ED frame and frame t.The output is the mesh motion field.For each training iteration, frame t is randomly selected from the cardiac cycle.
) Here, {λ i , β i , γ i } i={1,2} are hyper-parameters chosen experimentally depending on the dataset.We use the Adam optimizer (learning rate = 10 −4 ) to update the network parameters.Our model is implemented by Pytorch and is trained on a NVIDIA RTX A5000 GPU with 24GB of memory.

IV. EXPERIMENTS
We evaluate the performance of 3D mesh reconstruction and mesh motion tracking on the LV myocardium.We compare the proposed method, named as DeepMesh, with other imagebased and mesh-based cardiac motion tracking methods.We explore the effectiveness of different loss components and the influence of the hyper-parameters.We show the key results in the main paper 4 .The dynamic motion tracking videos can be found in https://github.com/qmeng99/DeepMesh.A. Experiment setups 1) Data: Experiments were performed on randomly selected 530 subjects from the UK Biobank study [30].Each subject contains SAX, 2CH and 4CH view cine CMR sequences and each sequence contains 50 frames.SAX view images were resampled by linear interpolation from a spacing of ∼ 1.8 × 1.8 × 10mm to a spacing of 1.25 × 1.25 × 2mm while 2CH and 4CH view images were resampled from ∼ 1.8 × 1.8mm to 1.25 × 1.25mm.Based on the center of the intersecting line between the middle slice of the SAX stack and the LAX view images, the SAX, 2CH and 4CH view images are cropped to cover the whole LV in the center.The input LV template mesh is provided by [3].This template contains 22, 043 vertices and 43, 840 faces.For model training, 2D segmentations are used to supervise mesh reconstruction and motion tracking.The 2D binary segmentations used in Eq. 10 were extracted from a 3D high resolution segmentation.This 3D high resolution segmentation is generated via an automated tool provided in [13], followed by manual quality control.We use 3D myocardial meshes of the ED frame and the end-systolic (ES) frame for evaluation.These ground truth 3D meshes are reconstructed from the 3D high resolution segmentations using the marching cube algorithm.We split the dataset into 400/50/80 for train/validation/test and train the proposed model for 300 epochs.We choose the hyper-parameters using grid search and select the hyperparameters with the best performance on the validation data.Specifically, the hyper-parameters in Eq. 11 are chosen from λ 1 = [10,20,30,40,50], β 1 = [0.1,0.3, 0.5, 0.7, 0.9] and γ 1 = [0.1,0.3, 0.5, 0.7, 0.9], and are selected as λ 1 = 20, β 1 = 0.5, γ 1 = 0.5.In Eq. 12, the hyper-parameters are chosen from λ 2 = [100, 130,150,170,190], β 2 = [10,20,30,40,50] and γ 2 = [0.1,0.3, 0.5, 0.7, 0.9] and are selected as λ 2 = 150, β 2 = 20, γ 2 = 0.5.In Eq. 9, we select τ = 3 from τ = [2,3].
2) Evaluation metrics: For evaluating the performance of 3D motion tracking on meshes, we compared the predicted 3D mesh and the ground truth 3D mesh at the ES frame.In addition, we extract 2D contours of the myocardium at SAX and LAX view planes from the predicted 3D meshes, and then compare the extracted 2D contours with the ground truth 2D contours (extracted from ground truth 3D meshes).The following metrics are used for evaluation: Surface distance, Hausdorff distance (HD) and Boundary F-score (BoundF).The surface distance evaluates the distance between the predicted 4 Code is at DOI: 10.5281/zenodo.8200635The Hausdorff distance quantifies the contour distance while Boundary F-score evaluates contour alignment accuracy as described in [8,16,29].Here, to compute the Hausdorff distance at the SAX view, we average the Hausdorff distance of the second slice (slice 1), the middle slice (slice 4) and the second last slice (slice 7).
3) Baseline methods: We compared the proposed method with five state-of-the-art cardiac motion tracking approaches, including two conventional methods and three learning-based methods.The two conventional methods are a B-spline free form deformation (FFD) algorithm5 [38] and a diffeomorphic Demons (dDemons) algorithm 6 [46] which have been used in many recent cardiac motion tracking works [4,6,31,32,34,45].For the learning-based method, the UNet architecture has been used in many recent works for image registration [5,43,51], and thus our third baseline is a deep learning method with 3D-UNet7 [9].In addition, we compared the proposed method with MulViMotion8 [24] and MeshMotion [23] which are two deep learning-based methods that utilize multi-view cardiac CMR images for 3D motion tracking.For fair comparison, we evaluated several sets of hyper-parameter values for all methods and selected hyper-parameters that achieve the best Hausdorff distance on the validation set.

B. Mesh-based motion tracking 1) Mesh reconstruction performance:
The proposed method first reconstructs the mesh of the ED frame for each test subject.Fig. 6 (a) shows that the reconstructed mesh fits the ground truth mesh for a sample case.We extracted SAX, 2CH and 4CH view planes from the reconstructed ED frame mesh and generated 2D segmentations on different view planes.Fig. 6 (b) and Table I qualitatively and quantitatively show the effectiveness of the mesh reconstruction by comparing the generated and the ground truth 2D myocardial contours.
2) Mesh motion estimation performance: Following mesh reconstruction, the proposed method estimates mesh motion fields in the full cardiac cycle.For each test subject, with the obtained vertex-wise motion fields {∆V t |t = [0, 49]}, Fig. 7: Examples of motion tracking results.The reconstructed ED frame mesh is deformed to the t-th frame using the estimated 3D mesh motion fields.2D myocardium contours on SAX, 2CH and 4CH view planes (Row 2-4) are generated by extracting the corresponding planes from the predicted t-th frame mesh.Red contours are predicted results while green contours are ground truth.
the reconstructed ED frame mesh is deformed to the t-th frame.Red meshes in Fig. 7 shows that the estimated mesh motion field ∆V 0→t enables 3D myocardial motion tracking on meshes.In addition, we extracted SAX/2CH/4CH view planes from the predicted t-th frame mesh and generated the predicted 2D myocardium contours on different view planes.Fig. 7 shows the effectiveness of ∆V 0→t by comparing the predicted and the ground truth 2D myocardium contours.
3) Comparison study: We compare the proposed method with baseline methods for the performance of motion estimation across the cardiac cycle.Fig. 8 demonstrates that Mul-ViMotion [24], MeshMotion [23] and the proposed method are able to estimate both in-plane and through-plane motion while other methods only show motion within SAX plane.This is because [23,24] and our method take full advantage of both SAX and LAX view images.Different from MulViMotion [24] which estimates a voxel-wise motion field and generates 3D meshes from segmentations, the proposed method directly estimates the motion of each vertex on the heart mesh, and thus is able to keep the number of vertex and the vertex correspondences across the cardiac cycle.In contrast to MeshMotion [23] where the ED frame mesh of an individual heart is needed before motion tracking, the proposed method directly reconstructs the ED frame mesh by propagating from a template mesh.It integrates mesh reconstruction and mesh tracking into a single framework and also ensures the consistency of the meshes across different subjects.In addition, compared to [23], we add a regularization loss L 0→t reg in this work to penalize the smoothness of the intermediate dense motion field (Φ 0→t ).The results show that the proposed method achieves smoother LV basal part than [23], e.g., in t = 20 and t = 40 frame in Fig. 8.
We further compare different methods by estimating the 3D motion field from ED frame to ES frame, which shows the largest deformation.Table II shows the quantitative comparison results and Fig. 9 shows the qualitative results.From Table II, we observe that the proposed method outperforms all baseline methods and achieves the best performance regarding SAX, 2CH and 4CH view segmentations.In addition, the proposed method obtains the ES frame mesh which is most similar to the ground truth ES frame mesh in Fig. 9.These results demonstrate the effectiveness of the proposed method for estimating 3D mesh motion fields.
4) Ablation study: For the proposed method, we explore the effects of using different anatomical views and loss combinations in the mesh reconstruction and the mesh motion estimation.We utilize Hausdorff Distance (HD) for the evaluation.Table III and Table IV show that adding the LAX view images improves the performance.This might be because LAX views can introduce high-resolution throughplane information for 3D motion estimation.These tables also show that proposed method with all the losses performs best in both mesh reconstruction and motion tracking, which illustrates the importance of each loss component.

5)
The influence of hyper-parameters: We evaluate the performance of mesh reconstruction and mesh motion estimation under various values of the hyper-parameters.Specifically, we compute Hausdorff distance (HD) based on the predicted and ground truth 2D myocardium contours on SAX, 2CH and 4CH view planes.We compare the contours of the ED frame for the mesh reconstruction and compare the contours of the ES frame for the mesh motion estimation.Fig. 10 shows that in contrast to LAX views, the performance on the SAX view is not sensitive to hyper-parameters.This might because the SAX stacks contain multiple slices while the 2CH and 4CH view only have a single slice for evaluation.From the last row in Fig. 10 (a), we observe that a weak or a strong regularization on voxel-wise displacement may reduce the accuracy of mesh reconstruction.FFD [38] dDemons [46] 3D-UNet [9] MulViMotion [24] MeshMotion [23] DeepMesh t=0 t=10 t=20 t=30 t=40 Fig. 8: Motion tracking results across the cardiac cycle using the baseline methods and the proposed method.
V. DISCUSSION In the mesh motion estimation framework presented in this work, we predict the motion field of the heart mesh by sampling from an intermediate voxel-wise 3D motion field.An alternative to our method would be to estimate mesh motion fields directly from input images via fully connected layers without intermediate voxel-wise 3D motion estimation.However, using fully connected layers to estimate the displacement of ∼ 20K vertices needs large GPU memory, which may not always be available.
We use the weighted Hausdorff Distance to compare the extracted 2D contours and the ground truth 2D contours of myocardium in L tpl→0 shape and L 0→t shape .Other boundary similarity measurements that can evaluate the distance between soft-  labeled and hard-labeled point sets may also be applied to this loss component in our task.When evaluating motion estimation, we quantitatively evaluated the performance on the ES frame.This is because 3D ground truth meshes are only available at the ED and ES frames in our current dataset.More importantly, ES frame has the largest deformation from the ED frame, which is the most challenging case in motion estimation.Besides, using the ES frame for quantitative evaluation is same as other previous works, such as [33,34,53].
We separately train the mesh reconstruction module and the mesh motion estimation module during training but the proposed method is end-to-end trainable.The probability map (2D soft contours) obtained from the differentiable mesh-toimage rasterizer enables the differentiability of the rasteriza-FFD [38] dDemons [46] 3D-UNet [9] MulViMotion [24] MeshMotion [23] DeepMesh ES frame (GT) tion.However, simultaneously training mesh reconstruction and mesh motion estimation may increase the complexity of hyper-parameter tuning.The proposed deep neural network in the mesh reconstruction module focuses on deforming the template mesh to the ED frame mesh of individual subjects.To move the template mesh to individual subject space before mesh reconstruction, we utilize the information about the relative position in the DICOM header of 2D images.Fig. 11 shows an example of shows that the template is not in the same space as the subject mesh.(b) demonstrates that we can move the template to the subject space after data pre-processing.Green meshes are the ground truth subject mesh.Blue meshes are the template before and after data pre-processing.
moving the template mesh to a subject space during data preprocessing.
Our evaluation has been conducted on LV myocardial motion tracking because it is important for clinical assessment of cardiac function.However, the proposed method is not limited to LV myocardium.Our model can be easily adapted to 3D right ventricular myocardial motion tracking by using the corresponding template mesh and the ground truth 2D contours during training.
Table III and Table IV show that only using shape regularization (L tpl→0 shape and L 0→t shape ) achieves second best quantitative results.However, Fig. 12 demonstrates that shape regularization alone is insufficient for good qualitative results while other regularization terms make contributions as well, to surface smoothness, accurate deformation and deformation smoothness, accordingly.
The proposed method is trained and evaluated on healthy subjects, where we aim to demonstrate the effectiveness of the methodology.We acknowledge that the current trained model may not achieve best performance on pathological data, especially hearts with specific diseases.To address this limitation, one possible solution is to include more pathological cases in the training set and re-train the model.In addition, there can be large deformation between the template mesh and pathological hearts, for which we may need to add extra regularization terms to the template-based mesh reconstruction module.
We believe that our mesh-based motion tracking method can benefit a variety of clinical applications.The proposed model can provide an accurate and holistic estimation of 3D geometry and motion, which could be used for clinical prediction and association tasks where conventional metrics provide weak discrimination.For example, we can model the association between cardiac motion (either globally, or vertex-wise) and demographics (e.g., age, gender), genetic predisposition, and disease risk factors.In particular, as our method maintains an anatomical correspondence of the cardiac meshes (i.e., the number of vertices and faces) in the cohort, it can facilitate learning complex motion features for specific tasks in a population.This could enable the use of motionrelated traits for early diagnosis or monitoring of disease progression.In addition, our method can support biophysical modeling by using meshes as input for mechanical simulations.This can potentially improve our understanding of cardiac physiology.Also, the predicted sequence of meshes can be used for computing conventional volumetric and functional biomarkers (e.g., ED volume, ejection fraction).However, in many existing clinical studies, volumetric biomarkers are computed from segmentations.To gain clinical acceptance, complex motion based traits derived from mesh-based methods will need to be thoroughly validated against the conventional segmentation-based metrics used in current practice.
VI. CONCLUSION In this paper, we propose a novel deep learning method for template-guided mesh-based cardiac motion tracking.The proposed method reconstructs the 3D heart mesh of the reference frame and estimate per-vertex motion field from 2D SAX and LAX view CMR images.The proposed method enables both mesh reconstruction and mesh motion tracking.It is also capable of maintaining the number of vertices and vertex correspondences across the cardiac cycle.Experimental results demonstrate the effectiveness of the proposed method compared with other competing methods.

E
STIMATING left ventricular (LV) myocardial motion is important for the detection of LV dysfunction and the This research has been conducted using the UK Biobank Resource under application number 40616.This work is supported by the British Heart Foundation (RG/19/6/34387, RE/18/4/34215); Medical Research Council (MC UP 1605/13); National Institute for Health Research (NIHR) Imperial College Biomedical Research Centre.W. Bai is supported by EPSRC Deep-GeM Grant (EP/W01842X/1); D. Rueckert is supported by ERC Advanced Grant Deep4MI (884622).(Declan O'Regan and Daniel Rueckert are joint senior authors).For the purpose of open access, the authors have applied a creative commons attribution (CC BY) licence to any author accepted manuscript version arising.
Mesh reconstruction (optimized by the loss   ) (b) Mesh motion estimation (optimized by the loss   )

Fig. 2 :
Fig. 2: An overview of the mesh reconstruction module.This module reconstructs the ED frame mesh of individual subjects from a template mesh and multi-view images.In this module, the deformation network (H D (•)) predicts an intermediate voxelwise displacement Φ tpl→0 , and then ∆V tpl→0 containing the per-vertex displacement is generated by sampling from Φ tpl→0 .

Fig. 3 :
Fig. 3: A diagram of the network architecture of (a) the deformation network H D (•) and (b) the motion network H M (•).Here, Conv represents convolutional layer with Relu and batch normalization while deConv represents transposed convolutional layer with Relu and batch normalization.The detailed network architecture and code can be found in https://github.com/qmeng99/DeepMesh.

Fig. 4 :Fig. 5 :
Fig.4: An overview of the mesh motion estimation module.This module estimates the motion of the heart mesh from the ED frame to the t-th frame.It takes multi-view images of the ED frame and the t-th frame as input and learns vertex-wise 3D mesh motion field ∆V 0→t via predicting an intermediate voxel-wise motion field Φ 0→t .By updating the myocardial mesh of the ED frame with ∆V 0→t , the mesh of the t-th frame is predicted.

Fig. 6 :
Fig. 6: An example of ED frame mesh reconstruction.(a) Left: ground truth mesh (Green) of a subject heart vs. the template (Blue); right: ground truth mesh (Green) vs. the reconstructed mesh (Red).(b) 2D contours on SAX, 2CH and 4CH view planes generated by rasterizing the reconstructed mesh on corresponding view planes.Red contours denote predicted results, while green contours denote ground truth.

Fig. 9 : 2 Fig. 10 :
Fig. 9: Motion estimation using baseline methods and the proposed method.Green mesh is ground truth (GT) mesh of the ES frames.Red meshes are the predicted ES frame meshes based on different methods.
Fig.11: Comparison of the template and a subject ED frame mesh.(a) shows that the template is not in the same space as the subject mesh.(b) demonstrates that we can move the template to the subject space after data pre-processing.Green meshes are the ground truth subject mesh.Blue meshes are the template before and after data pre-processing.

Fig. 12 :
Fig. 12: Qualitative results of mesh reconstruction and mesh motion estimation with different combinations of losses.The top row shows the reconstructed ED frame mesh.The bottom row shows the estimated ES frame mesh.

TABLE I :
Mesh reconstruction performance by comparing the predicted and ground truth 2D myocardium contours on different view planes.The results are reported as "mean (standard deviation)".

TABLE II :
Comparison of other cardiac motion tracking methods.The results are reported as "mean (standard deviation)".↑ indicates the higher value the better while ↓ indicates the lower value the better.Best results in bold.

TABLE III :
Mesh reconstruction with different anatomical views and different loss combinations.

TABLE IV :
Mesh motion estimation with different anatomical views and different loss combinations.