Communication Efficient Federated Generalized Tensor Factorization for Collaborative Health Data Analytics

Modern healthcare systems knitted by a web of entities (e.g., hospitals, clinics, pharmacy companies) are collecting a huge volume of healthcare data from a large number of individuals with various medical procedures, medications, diagnosis, and lab tests. To extract meaningful medical concepts (i.e., phenotypes) from such higher-arity relational healthcare data, tensor factorization has been proven to be an effective approach and received increasing research attention, due to their intrinsic capability to represent the high-dimensional data. Recently, federated learning offers a privacy-preserving paradigm for collaborative learning among different entities, which seemingly provides an ideal potential to further enhance the tensor factorization-based collaborative phenotyping to handle sensitive personal health data. However, existing attempts to federated tensor factorization come with various limitations, including restrictions to the classic tensor factorization, high communication cost and reduced accuracy. We propose a communication efficient federated generalized tensor factorization, which is flexible enough to choose from a variate of losses to best suit different types of data in practice. We design a three-level communication reduction strategy tailored to the generalized tensor factorization, which is able to reduce the uplink communication cost up to 99.90%. In addition, we theoretically prove that our algorithm does not compromise convergence speed despite the aggressive communication compression. Extensive experiments on two real-world electronics health record datasets demonstrate the efficiency improvements in terms of computation and communication cost.


INTRODUCTION
phenotypes) from the health data, is an indispensable stepping stone towards in-depth medical decisionmaking, including precision medicine, influenza surveillance, drug discovery, to name a few. Computational phenotyping is known to be challenging, given the fact that health data are collected from a large number of individuals with each one's medical record consisting of various of medical procedures, medications, diagnosis and lab tests. That is, the health data is massive and multidimensional. In addition, in order to collaboratively learn phenotypes from the data belonging to different institutes (known as collaborative phenotyping), the sensitive nature of the health data serves as an additional restriction.
To learn phenotypes from the multidimensional EHR data, tensor factorization has received increasing interest [12-14, 20, 27, 28, 36]. Tensor has the intrinsic capability to succinctly represent the multidimensional data [21] and has applications beyond health data analytics, e.g., recommender systems [18], spatio-temporal data analysis [26], computer vision [35], and signal processing [32]. The CANDECOMP/PARAFAC or canonical polyadic (CP) tensor factorization (TF) [7,11] and its generalization GTF [15] are fundamental tools for analyzing the tensors. Despite their effectiveness and wide applications, the scalability is often a major issue preventing it from being applied to larger scale health datasets, which are commonly encountered nowadays. To improve the scalability of TF, distributed tensor factorization (DTF) methods [6,9,12,20,27,31,41] are capable of processing large tensors that cannot be dealt by a single machine. It also complies with the practical scenario for the health data which is collected and held across multiple physically distributed medical institutions.

Contributions
In this paper, we investigate how to reduce the uplink communication cost of the federated tensor factorization-based collaborative phenotyping with guaranteed convergence and quality preservation. It is a challenging task, especially considering the communication efficiency issue is under studied in the broader distributed tensor factorization literature. To be more flexible and suitable for a variety of applications, we consider the federated generalized tensor factorization (FGTF), which greatly extends the existing federated classic TF [20,27].
First, we aim to reduce the uplink communication cost in each communication round. We design a two-level per-round communication reduction strategy: block-level and elementlevel, which reduce 1 − 1 D and over 96.8% of the uplink communication, correspondingly, where D is the number of blocks. For the block-level, we exploit the multi-factor structure of TF/GTF by utilizing the randomized block update. It enables each client to send only the partial gradient of the sampled block, rather than the full gradient of all blocks. For the element-level, we introduce gradient compression techniques, which have found success in deep learning training [2,4,19,37,42], to compress each element of the communicated partial gradient from the floating point representation to low-precision representation. Since there exists error between the true partial gradient and the compressed one, the convergence can be slower and the output quality can be lower. We further introduce the error-feedback mechanism [19] which records such error and feeds it back to restore the shift.
With both levels of per-round communication reduction, we propose the federated GTF with communication compression and error-feedback (FedGTF-EF). We analyze the convergence of FedGTF-EF and obtain the O 1 T rate after T iterations (Thm. 4.1) under common and mild assumptions (Assumptions 4.1-4.5). The convergence is equivalent to the distributed stochastic gradient descent (SGD) with full precision gradient communication and distributed SGD with gradient compression and error-feedback [42]. In addition, since constraints and nonsmooth regularizations are common in GTF, we further extend the convergence result to the proximal setting (4.2) where the additional "simple regularizer" in Assumption 4.6 is satisfied. Compared to the existing analysis with gradient compression and error-feedback, our convergence analysis accounts for both the block randomized update strategy and the proximal operation.
Second, we reduce the number of communication rounds to further reduce the uplink communication. To do so, we introduce periodic communication [4,23,33] into FedGTF-EF and denote this algorithm as FedGTF-EF-PC, in which the clients send the update to the server after τ > 1 local iterations instead of communicating after every iteration. A key question is whether the periodic communication will slow down the convergence. If so, the number of iterations will increase and the overall number of communications may not reduce. We analyze the convergence of FedGTF-EF-PC in Thm. 4 Third, we evaluate FedGTF-EF and FedGTF-EF-PC in the federated collaborative phenotyping task. We conduct experiments on two real-world EHR datasets, which show that the proposed method can effectively reduce uplink communication cost (99.90% reduction), without compromising convergence and factorization quality.

Notation
The frequently used notation in this paper is summarized in Table 1. We denote an order D tensor by ∈ ℝ is called a mode-d fiber of .

Generalized Tensor Factorization
As illustrated in Figure 1, let us consider the EHR tensor ∈ ℝ I 1 × , …, × I D , which consists of patient mode (I 1 ), diagnosis mode (I 2 ), medication mode (I 3 ), and so on. The regularized Generalized CANDECOMP-PARAFAC (GTF) [15] extracts the phenotypes by decomposing the EHR tensor into R phenotyps, where each consists of a patient factor, diagnosis factor, and a medication factor. GTF has the following objective function: which breaks down into three parts:

1.
Factorization constraint: The constraint of approximates the low-rank CP tensor ∈ ℝ I 1 × , …, × I D as the sum of R rank- is the d-th factor matrix and A (d) (: , i) is its i-th column. For phenotyping, A (1) , A (2) , A (3) correspond to the patient factor, diagnosis factor, and medication factor, correspondingly.

PROPOSED METHODS
Under the federated setting as illustrated in Fig. 2, the EHR tensor ∈ ℝ I 1 × , …, × I D will be collectively held by K institutions. The k-th client's local tensor is denoted by which contains information about I 1 k individuals, such that That is, we consider the horizontally partitioned setting where different hospitals share the same feature space. We also note that there are related works addressing other settings like vertically partitioned settings [8,24,25,39] which are complementary to our work. The aim of the federated computational phenotying is to collaboratively compute the phenotyes from EHR tensor across K institutions without sharing the raw tensor and patient mode variables. The objective function of the federated GTF is as follows In fact, the above formulation can be extended to general multi-block problems as well. Thus, our algorithms are not limited to federated GTF problems but also to other nonconvex problems possessing a multi-block decision variable structure, e.g. [40]. In the following, we propose the federated generalized tensor factorization with communication efficiency improvements via block randomization, gradient compression, error feedback and periodic communication. The execution of the proposed algorithm is illustrated in Fig. 3.

FedGTF-EF: Communication Efficient GTF with Block Randomization, Gradient Compression and Error-Feedback
We reduce the uplink communication in each communication round at two levels: blocklevel and element-level. The detailed algorithm is displayed in Algorithm 1 with functionalities of key steps annotated. At the block-level, to avoid sending all factors, we use a randomized block (i.e., randomized factor) update, which only requires the communication of the partial gradient of the factor being sampled (the computation of the partial gradient will be detailed in Sec.3.3). At the element-level, we compress each element of the communication to a low-precision representation before sending to the server (Line 6). Each client k keeps D local pairs of P (d) k (the error-shifted full-precision partial gradient), Δ (d) k (the compressed gradient to be communicated), E (d) k (error record between the full precision gradient and the compressed gradient), for all d = 1,...,D factors. Depending on whether the regularizer is smooth or not, either simple gradient descent (Line 8) or proximal gradient descent (Line 9) can be chosen to update the sampled factor, respectively.

Algorithm 1
FedGTF-EF: Communication Efficient GTF with Block Randomization, Gradient Compression and Error-Feedback On Each Client Nodes k ∈ 1, …, K:

FedGTF-EF-PC: Further Communication Reduction by Periodic Communication
We further reduce the uplink communication cost by introducing a third communication compression level: round level. That is, we decrease the communication frequency from one iteration per-communication to τ > 1 iterations per-communication, which manifests a periodic communication behaviour [4,23,33]. The detailed algorithm is provided in Algorithm 2. The major difference with Algorithm 1 is that each client compresses and sends the collective updates across τ iterations (Line 9-10), instead of the partial gradient in a single iteration. The error feedback (Line 9) and error memory (Line 7, 13) are adjusted accordingly.

Efficient Partial Stochastic Gradient Computation for FedGTF
After presenting the overall algorithms, we now present an efficient partial stochastic gradient computation subroutine to compute G (d) Step 1 of Fig. 3 and Line 4 of Algorithm 1 and 2. The first mode (i.e., I 1 ) is the individual mode (e.g., patient mode) which can be kept local to each client. Thus, when d ξ [t] = 1, we skip the communication, which not only further reduces the communication cost, but also is beneficial to the privacy since the individual-level information is not shared.
Next, we specify the computation of the partial stochastic gradient G (d) based on the efficient fiber sampling technique [5,10]. The deterministic partial gradient is is the mode-d Khatri-Rao product of the all Algorithm 2

FedGTF-EF-PC: Further Reducing Communication Cost by Periodic Communication
On Each Client Nodes k ∈ 1, …, K:

17:
On Server Node: 18: [t] to all client nodes; 19: end for the corresponding | | rows of H d , where denotes the index of the sampled fibers. The stochastic partial gradient is then where H k (s, : According to the complexity analysis, our gradient computation in eq.(4) matches the stateof-the-art efficiency of GTF computation, e.g., [10].

ALGORITHM ANALYSIS
This section presents the convergence analysis and complexity analysis of FedGTF-EF and FedGTF-EF-PC. A proof sketch of the convergence analysis is provided in the appendix.

Convergence Analysis
Assumptions.: In order to analyze the convergence, we make the following assumptions which are common to many machine learning problems [4,10,34,42]. Let the randomness of computing stochastic gradient of G ,

ASSUMPTION 4.3.: (Bounded Variance)
The stochastic gradient has bounded variance: Many common regularizations satisfy this assumption, for example, the ℓ 1 -norm for inducing sparsity which has the soft-thresholding operator as its proximal operator.
Thus, the recurrence can be viewed as the block randomized added. The convergence of Algorithm 1 applied to the smooth smooth regularization is as follows.
Remark 1.: Under the similar assumptions, our convergence rate matches the rates of the distributed synchronize SGD and the distributed SGD with gradient compression and errorfeedback [42]. Thus, we can further reduce computation and uplink communication from a full-length gradient update and communication [4,42] to a single randomized block of the partial gradient update and communication without slowing down the convergence rate.

Nonsmooth regularization case.:
This case corresponds to the execution of Line 9 in Algorithm 1. An appropriate optimally condition is based on the generalized gradient measure [10,29,30,38]:

F(A[t]
) . The following theorem shows the convergence of Algorithm 1 for the nonsmooth regularization case.

Remark 2.:
In the nonsmooth regularization case, the above convergence result is weaker than the previous smooth case in that we only ensure the difference between the initial loss and the optimal value will get smaller, but the generalized gradient is not guaranteed to approach 0 given that the variance and gradient norm related terms will dominate with increasing T. However, our empirical results show that the algorithm is able to converge to small losses.

Convergence Analysis of Algorithm 2.
: Now, we provide the convergence rate of Algorithm 2 by extending the proof in [4] to the block randomized setting, which is obtained under the same assumptions with Theorem 4.1.

The main idea for the analysis is to introduce the virtual sequence of
and build an iterative descent relation for it.

Meanwhile, we keep track of the error between the true and virtual averages of
avg [t], and the deviation between the local variables and the true average of

Complexity Analysis
We provide the computation, storage and communication complexities for FedGTF-EF and FedGTF-EF-PC given | | fibers being sampled by each client and the rank of the GTF being R.
Computational Complexity.-Our method is very efficient when compared to the following methods: 1) the classic CP-ALS and the full gradient descent-based GTF, which 2) the sampled randomized CP-ALS in [5] and SGD-based GTF in [15] with the same number of elements sampled, which cost O R | ∑ d = 1 D I d ; and 3) the same complexity as the full precision block randomized SGD-based TF [10]. Storage Complexity.-The fiber sampling based stochastic partial gradient avoids forming the whole element-wise partial gradient tensor , which reduces the storage for this

thus achieving the same cost efficiency with
sampling-based random CP-ALS [5], full precision SGD [15] and block randomized full precision SGD [10].

Experimental Setup
Datasets.-We consider two real-world EHR datasets 1 , as well as a synthetic dataset, which are introduced below, common assumptions applied not only to generalized tensor factorization problems but also to more general machine learning problems possessing a multi-block structure. Our algorithm can maintain low computational and storage complexity while occupying much lower uplink communication cost. We demonstrate its superior efficiency and uncompromised quality on synthetic and two real-world EHR datasets.

ACKNOWLEDGMENTS
We sincerely thank all anonymous reviewers for their valuable comments. This work was supported by the National Science Foundation, awards IIS-#1838200 and CNS-1952192, National Institutes of Health (NIH) awards R01GM118609, 5K01LM012924, and CTSA UL1TR002378.

A.1 Parameter Settings
For MIMIC-III, CMS and synthetic datasets, each algorithm is run for 500 iterations per epoch until converge, while for delicious dataset, each algorithm is run for 1000 iterations per epoch. For GCP algorithm, we tune the stepsize within the range of {10 −8 , 10 −9 , 10 −10 , 10 −11 }, while for the rest algorithms, we tune the stepsize by grid search through {2 2 , 2 1 , 2 0 , 2 −1 , 2 −2 ,..., 2 −11 }. The parameter for the proximal operator is set to 10 −4 for all the algorithms with the proximal operators (FedGTF-EF-prox, DPFact-prox). For all the federated algorithms, we by default horizontally partition the tensor (along I 1 mode) into 8 tensors without overlapping and distribute each of them to 8 client nodes respectively. We also test different numbers of workers (16 workers and 32 workers), where the stepsizes are set to the same as for 8 workers. The best stepsizes for each algorithms for different datasets are set as in Table 4 and 5.
Each experiment is averaged over 5 repetitions. All experiments are run on Matlab 2019a on an r5.12xlarge instance of AWS EC2 with Tensor Toolbox Version 3.1 [3].

A.2 Additional Experiments
Two additional groups of figures are presented here. Fig. 8 shows the loss decrease for both the Bernoulli loss and the Least Square loss with respect to time and communication for the synthetic data. Fig. 9 shows the Bernoulli loss and the Least Square loss decrease with respect to epochs in supplementary to the figures showed in the main paper with respects to time and communication. Similar conclusions can be drawn with the real-world EHR datasets in the main paper. That is, the proposed algorithms achieve more efficient convergence than the centralized baselines under the Bernoulli logit loss and the distributed baseline under the least square loss. It is also more communication-efficient than the algorithms without gradient compressor (BrasCPD distributed version) and without the block randomized mechanism (DPFact and its variants).  Best Stepsizes for the Least Square Loss Algorithm  The following auxillary variables and virtual iterations are introduced only for the proof:

B.1.2 Additional Lemma.
The following lemma extends Lemma 3 in [19] to our block randomized case.

B.1.3 Main proof sketch of Theorem 4.1.
By block-wise Lipschitz smoothness assumption of the loss function:

By Assumption 4.2 that ζ[t]
1 K Taking conditional expectation on both sides of eq.(B.1.3) with respect to filtration ℱ[t] and randomness of ζ [t] during the stochastic gradient computation and plugging eq.(7) in, we have Taking expectation with respect to ξ[t] conditioned on ℱ[t] and substituting By Lemma B.1 and let γ[t] = t, we have Taking total expectation with respect to all the random variables in ℱ[t], and averaging the above from t = 0 to T and letting ρ < 2−Lγ, F * the optimal value, we have we have By setting ρ = 1 and using 1 D for some ϱ > 0, we complete the proof of Theorem 4.1:

B.2 Proof Sketch of Theorem 4.2 B.2.1 Auxiliary variables for the proof and iterative relation.
We derive the convergence by regarding the iteration as using inexact gradient, which is different from the approach used for the smooth case which is regarded as using delayed variable: We define the generalized gradient

B.2.3 Main Proof sketch of Theorem 4.2.
By the block-wise smoothness of F, the convexity of r (d) ( ⋅ ), and the optimality of By Lemma B.2, we have By bounding the third row of the above equation, choosing ρ 1 = 2γ[t] and ρ 2 = 2, with eq.
(11), and letting , we have Taking conditional expectation with respect to ξ[t] conditioned on filtration ℱ[t], by Lemma B.1 and letting γ[t] = t, we have Taking total expectation (i.e. with respect to all random variables in ℱ[t]), averaging from t = 0 to T and using ∑ d = 1    Comparison of algorithms in ablation study.