A Regularized Cox Hierarchical Model for Incorporating Annotation Information in Predictive Omic Studies

Abstract Background Associated with high-dimensional omics data there are often “meta-features” such as biological pathways and functional annotations, summary statistics from similar studies that can be informative for predicting an outcome of interest. We introduce a regularized hierarchical framework for integrating meta-features, with the goal of improving prediction and feature selection performance with time-to-event outcomes. Methods A hierarchical framework is deployed to incorporate meta-features. Regularization is applied to the omic features as well as the meta-features so that high-dimensional data can be handled at both levels. The proposed hierarchical Cox model can be efficiently fitted by a combination of iterative reweighted least squares and cyclic coordinate descent. Results In a simulation study we show that when the external meta-features are informative, the regularized hierarchical model can substantially improve prediction performance over standard regularized Cox regression. We illustrate the proposed model with applications to breast cancer and melanoma survival based on gene expression profiles, which show the improvement in prediction performance by applying meta-features, as well as the discovery of important omic feature sets with sparse regularization at meta-feature level. Conclusions The proposed hierarchical regularized regression model enables integration of external meta-feature information directly into the modeling process for time-to-event outcomes, improves prediction performance when the external meta-feature data is informative. Importantly, when the external meta-features are uninformative, the prediction performance based on the regularized hierarchical model is on par with standard regularized Cox regression, indicating robustness of the framework. In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.


Background
Prediction based on high-dimensional omics data such as gene expression, methylation, and genotypes are important for developing better prognostic and diagnostic signatures of health outcomes.However, developing prediction models with high-dimensional omics data, where the number of features is often orders of magnitude larger than the available number of subjects is challenging.Sparse regularized regression, which includes the Lasso (1) and its variants, elastic net (2), adaptive Lasso (3), group Lasso (4) and others, is a widely used approach for developing predictive models with high-dimensional data.Sparse regularized regression controls the model complexity via sparsity inducing penalties, which have the effect of shrinking the regression coefficient estimates toward zero and setting some coefficients exactly to zero, effectively selecting features predictive of the outcome.
Associated with high-dimensional omics data, there are often prior knowledge relating to the omic features that can be informative of the outcome of interest, i.e., meta-features.For example, to predict survival based on gene expression profiles, relevant information may be the grouping of genes into biological pathways.Gene grouping information can be encoded by an indicator meta-feature matrix, each row represents one gene, each column represents one gene group, 1 indicates gene belongs to the group and 0 otherwise.This type of prior knowledge provides information on the functions of genes, available in gene annotation resources.Integrating such annotation can give omic features in these groups extra importance, via e.g., higher weights for features in these groups or shrinking weights for features outside these groups, thus improving the precision of model parameter estimation.Another important type of meta-features is summary statistics from similar studies on identical omic features.Here, the omic features could be gene expressions, and the meta-features could be regression coefficients of single nucleotide polymorphisms (SNPs) from these studies.They can fit into the same meta-feature matrix as is used to encode gene annotations, each row represents one gene expression, each column represents one SNP, the values are regression coefficients of each SNP associated with each gene.This setting is similar to transcriptome-wise associations study (TWAS), in which we investigate genetic variants' effect on the outcome of interest through regulating expression levels.A simpler version of summary statistics is p-values, linkage disequilibrium (LD) scores, or regression coefficients from studies on the same outcome and identical omic features.The meta-feature matrix in this case only has several columns, e.g., p-values, LD scores, with each row represents respective summary statistics for a particular omic feature.This type of metafeature is also referred to as co-data in some studies (5,6).A third type of meta-feature is multiomics data.In studies where data such as SNPs, gene expressions, protein levels, DNA methylations, are available, it can be insightful to integrate them altogether in one model.In this case, the meta-feature matrix is also an indicator matrix, each row represents one omic feature, each column represents one omic type (e.g., SNPs, protein levels), 1 indicates the feature belongs to a type of omic data and 0 otherwise.As more omics data and resources become available, there will be more types of data that can fit into the meta-feature framework.Kawaguchi et. al. (2022) (7) have shown in linear regression that integrating additional prior information into regularized regression can yield improved prediction of an outcome of interest based on high-dimensional omic features.They developed a regularized hierarchical regression framework that can incorporate external meta-feature information directly into the predictive analysis with omic data.The approach is implemented in the R package xrnet.However, their method can only handle quantitative and binary outcomes.And the penalty types they are able to use are restricted to ridge.Therefore, feature selection is not available in their model.Since survival prediction is the main goal in many prognostic applications, we introduce a regularized hierarchical model, building on Kawaguchi et al. and Weaver et.al. (8), that can handle time-toevent outcomes and that can also perform meta-feature selection by the inclusion of a Lasso or elastic net regularization penalty.
There are many approaches for assessing the importance of meta-features after an analysis relating genomic features to an outcome of interest is performed.For example, gene set enrichment analysis (GSEA) (9)(10)(11) is performed after differential expression analysis to evaluate whether sets of related genes like those in the same biological pathway are over-represented.However, there are few approaches capable of incorporating meta-features directly into the modeling process.Approaches to incorporating meta-features a priori include the application of differential penalization based on external information and two-stage regression methods, where the outcome is first regressed on the genomic features and the resulting effect estimates are in turn regressed on the external meta features.Tai and Pan (2007) (12) grouped genes based on existing biological knowledge and applied group-specific penalties to nearest shrunken centroids and penalized partial least squares.Bergerson et al. (2011) (13) incorporates external metafeature information by weighting the LASSO penalty of each genomic feature with some function of meta-feature.Zeng et al. (2020) (14) on the other hand, incorporates external metafeature to customize the penalty of each feature with a different function of meta-feature.These three methods are based on idea 1), which no longer assuming every genomic feature are equally important, but of different importance based on external information.However, they cannot handle large number of meta-features.Chen and Witte (2007) (15) applied the idea of hierarchical modeling in a Bayesian framework, where second stage linear regression is normal prior distribution, first stage regression is normal conditional distribution, and estimated first stage regression coefficients with posterior estimator.This method does not apply to high dimensional data since it is standard regression with no regularization.The above data integration methods improve prediction compared to modeling with genomic features only.However, none of the approaches above can handle time-to-event outcomes.
In this paper, we introduce a regularized Cox proportional hazard hierarchical model to integrate meta-features.We will see that when external meta-features are informative, regularized hierarchical modeling improves prediction performance considerably.On the other hand, we also show that when the external meta-features are not informative, it does not perform worse than the standard regularized model, that does not use any external information.This shows that the model is robust to the informativeness of the meta-features and can be safely used when the meta-feature informativeness is a priori unknown, as it is typically the case.The model can be efficiently fitted using a combination of iterative reweighted least squares and cyclic coordinate descent as proposed for Lasso Cox regression by Simon et al. (16).

Setup and notations
We assume a survival analysis setting with outcomes from  subjects, (, ) = ( 1 , … ,   ,  1 , … ,   ), where  = ( 1 , … ,   ) is a vector of censoring status for each subject,   = 1 represents event occurs,   = 0 represents right-censoring;  = ( 1 , … ,   ) is the vector of observed time, if   = 1,   is event time, and if   = 0,   is censoring time.Let  denote the  ×  design matrix, where  is the number of features, each row represents the observations on one subject, and each column represents the values of one feature across the  subjects.We are particularly interested in the high dimension setting,  ≫ , where the number of features is larger than the sample size.The goal is to develop a predictive model for the outcome (, ) based on the data .
In a genomics context, the time-to-event outcome (, ) could be event free time, time to disease relapse, time to death.The design matrix  could be genotypes, gene expressions, DNA methylation.For example, in Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) data (section 3.2.1),outcome (, ) is breast cancer specific survival, data matrix  represents gene expressions with dimension    ×   .
Associated with each feature there is typically a set of meta-features annotations.If  consists of gene expression values, pathway gene sets could be meta-features indicating the set of genes involved.As for the METABRIC example, four meta-features are believed to be associated with breast cancer: genes with mitotic chromosomal instability (CIN), mesenchymal transition (MES), lymphocyte-specific immune recruitment (LYM), and FGD3-SUSD3 genes.Each meta-feature consists of a vector of indicator variables for whether a gene belongs to the functional gene group.The genomic meta-features can be collected into a matrix  of dimensions  × , where  is the number of meta-features.We propose a regularized hierarchical regression for integrating the external meta-feature information in  for predicting time-to-event outcomes based on the features in  min () is the negative log of the Cox partial likelihood function,  is a length  vector of regression coefficients corresponding to the features in , and  is a length  vector of regression coefficients corresponding to the meta-features in .The objective function [1] can be viewed as arising from a hierarchical model.In the first level of the hierarchy, the partial likelihood   () term in [1] corresponds to the time-to-event outcome modeled as a function of  via a Cox proportional hazard regression model.In the second level, the  2 penalty term ‖ − ‖ 2 2 corresponds to a linear regression of the estimate of  on the meta-feature information .It can also be thought of as an  2 regularization term that shrinks the estimate of  toward  rather than to the usual shrinkage toward zero.In the third level of the hierarchy, the term ‖‖ 1 is an  1 regularization penalty on the vector of estimated effects  �.It enables the selection of important meta-features by shrinking many of its components to 0. The hyperparameters  1 ,  2 ≥ 0 control the degree of shrinkage applied to each of the penalty terms and can be tuned by cross-validation.Finally, note that when  = 0, the objective function [1] reduces to a standard  2 -regularized Cox regression.
The partial likelihood function   () in [1] is the Breslow approximation (17) to the Cox partial likelihood.Letting  1 <  2 < ⋯ <   ( = 1, … , ) be unique event times arranged on increasing order, the Cox model assumes proportional hazards: where ℎ�,   � is the hazard rate for subject  with feature values   at time ; ℎ 0 () is baseline hazard rate at time , regardless of the feature values.The Cox partial likelihood function (18) can then be written as where   = {:   ≥   }, is the risk set at time   , i.e., the set of all subjects who have not experienced the event and are uncensored just prior to time   .The partial likelihood function allows estimation of  without explicitly modeling the baseline ℎ 0 , and it depends only on the order in which events occur but not on the exact times of occurrence.However, the partial likelihood assumes that event times are unique.To handle ties, where multiple individuals experience the event at the same time, we use the Breslow approximation of the partial likelihood in [3] where   = �:   = 1,   =   �, is the set of individuals who have event time   , and is the number of events at time   .Breslow's likelihood function automatically reduces to the partial likelihood when there are no ties.

Computations
The objective function [1] can be minimized efficiently using iterative reweighted least squares combined with coordinate descent (16).If the current estimates of the regression coefficients are ( � ,  �), we form a quadratic approximation to the negative log-partial likelihood by Taylor series around the current estimates.The approximated objective function has the form: where, In [6] and [7], () is a diagonal matrix with vector  as diagonal elements. is an  ×  indicator matrix with (, )th element (  ≥   ).Also,  � =  � is the linear predictor; is estimated baseline hazard rate at event time   ;  0 (  ) = ∑ ℎ 0 :  ≤  is cumulative baseline hazard at time   .In the first part of quadratic approximation [5], ( ′ − )  ( ′ − ) can be viewed as a weighted version of least squares with  ′ working as responses,  as weights.Weight matrix  is usually a diagonal matrix, however, in Cox proportional hazard model,  is a full symmetric matrix as shown in [7].This leads to computational difficulty as it requires calculation of  2 entries.According to Simon et al. ( 16), only the diagonal entries of  are needed for computations without much loss of accuracy, thereby speeding up implementation.The diagonal elements of ,   has the form: where   is the set of unique event time   such that   <   (the times for which observation  is still at risk).In computing weights   's, one bottleneck is that for each  in   , we need to calculate ∑   �  ∈ . Both   and   have  elements, so the weight computation complexity is ( 2 ).However, if   's are sorted in non-decreasing order, it is possible to reduce the weight computation complexity to be linear.Details are in Appendix A.1.Now, let  =  − , and use only diagonal elements of , the quadratic approximation [5] can be re-written as: This reduced the problem to repeatedly solving the regularized, weighted least squares problem using cyclic coordinate descent (19).Details are given in Appendix A.2.
The model learning process of regularized regression is controlled by shrinkage of regression coefficients toward 0, i.e., bias, and model complexity, i.e., variance.The more shrinkage of regression coefficients toward 0, the less complex the model is, and vice versa.Further examining equation [9], the regression coefficients of omic features, , are now represented by the first level coefficients  plus second level information , i.e.,  =  + .And  is penalized separately in  and .Taking gene functional groups as an example for meta-feature matrix , if a functional group is highly associated with the outcome of interest, the omic features in that group will be given extra importance by adding , thus coefficients of those features are less biased toward 0 (closer to unbiased maximum likelihood estimators), at the cost of added model complexity in .Now, provided the meta-features are highly informative in that unrelated functional group coefficients, a subset of , are shrunk to small values or 0 by  1 norm, the gain in bias reduction will outweigh added model complexity.

Two-dimensional hyperparameter tuning
The optimization approach described above is for fitting the model for one combination of the tuning parameters (  20) is applied along the two-dimensional path.The pathwise algorithm utilizes current estimates as warm start, since the solutions to the convex problem [9] is continuous.This character makes the algorithm remarkably efficient and stable.
The two-dimensional hyperparameter ( , gives rise to small values of solutions,  �.This makes convergence faster as the initial values,  � = 0, is not far away from the solutions.Applying this logic, we gradually decrease the values of  and initialize the model parameters at the solutions of last , which is called warm start, until arriving at near unregularized solution. 2 , the hyperparameter controlling the amount of shrinkage to  1 term ‖‖ 1 , is treated the same way.
The initial value of  2 , i.e.,  2 , is the smallest  2 value that makes the entire vector  � = 0.For the two-dimensional hyperparameter grid (Figure 1), we start with  1 () ,  2 () , select   = 0.01 ×   , and construct a sequence of 20  values from   to   on log scale, forming a 20 × 20 grid with  1 ,  2 on either dimension.
To apply warm start, one of the hyperparameters,  1 decreased along the sequence until reaching  2 () .This procedure is then repeated starting at +1) is the next value along the sequence of  1 ), with warm start at solutions of ) .The entire 20 × 20 grid is walked through in this way and ended at  1 () ,  2 () .

Algorithm
The computational procedure to fit the regularized hierarchical Cox model can be summarized as: • Initialize  and  with  � and  �.

Simulation Design
A simulation study is performed to evaluate the predictive performance of the hierarchical Cox regression model compared to standard penalized Cox regression.The main parameters controlled include informativeness of the meta-features, sample size, number of features and number of meta-features.The  ×  meta-feature matrix  is generated with each element drawn from an independent Bernoulli variable with probability 0.1.This mimics binary indicators for whether a gene belongs to a particular biological pathway.
The first level regression coefficients are generated as  =  + , where  ~ (,  2 ).To control the predictive power of the meta-features, the signal-to-noise ratio,  =   ()  2 ⁄ , is set, where the signal is the variance of  explained by , and  2 is the noise.A higher signal-to-noise ratio implies a higher level of informativeness of the metafeatures with respect to the coefficients .The data matrix  is generated by sampling from a multivariate normal distribution, (0, ), where the covariance matrix  has an autoregressive correlation structure   =  |−| for ,  = 1, … , .To control the predictivity of the features  for the outcome , Harrell's concordance index (Cindex) ( 22) is used as the metric to evaluate prediction performance.It is defined as the probability that a randomly selected patient who experienced an event has a higher risk score,   , than a patient who has not experienced an event at a given time.The C-index is an analog of the area under the ROC for time-to-event data.The higher the C-index, the better the model can discriminate between subjects who experience the outcome of interest and subjects who do not or have not yet.A random noise is added to the survival times  to control the C-index, where the noise is distributed as a normal with mean zero and a variance value set to yield a C-index of 0.8 across all simulation scenarios.This is the population/theoretical C-index of the generated survival data, achievable if  were known or if one had an infinite sample size.When  is estimated from a finite training set, the achieved model C-index will be lower.
The base case scenario is simulated with sample size  = 100, number of features  = 200, and number of meta-features  = 50.This is a high dimensional setting,  ≫ , typical of genomic studies.The first 20% of the coordinates of the meta-feature level coefficients  are set to be 0.2, and the rest are set to be 0. In the base scenario, the meta-features are highly informative, with a signal noise ratio set to 2. The covariance matrix  of  has autoregressive-1 structure, with parameter  = 0.5.In the following simulation situations, one of the parameters is varied and the others are fixed in each scenario.Simulations are performed 100 times for all scenarios.The models are trained on a training set of size  (100 in the base scenario and varied in other experiments), with the hyper-parameters  1 ,  2 tuned on an independent validation set of the same size as training set.The final predictive performance was evaluated on a large test set of size 10,000.
We run a series of experiment varying one key parameter at a time from the base case scenario as follows: Experiment 1: varying the signal-to-noise ratio of the meta-features from completely uninformative, (SNR=0), to slightly informative (SNR=0.1), to moderately informative, (SNR= 1), to highly informative (SNR=2).
Experiment 3: Varying the number of features from low to high:  = 200, 1000, 10000.

Simulation results
The results of the experiments are shown in Figure 2. In each panel, the horizontal dashed line representing the population/theoretical C-index, 0.8, i.e., the maximum achievable with infinite training data, is provided as a reference for each parameter setting.We compared the performance of the hierarchical ridge-lasso Cox model (ridge penalty on first level omic features, Lasso penalty on second level meta-features) incorporating meta-features to that of a standard ridge Cox model, and the performance of the hierarchical lasso-lasso Cox model (Lasso penalty on first level omic features, Lasso penalty on second level meta-features) to that of a standard Lasso Cox model.
With informative meta-features (SNR > 0 in experiments 1-4) the hierarchical ridge-lasso model consistently outperforms the standard ridge model, with the performance gain over the standard ridge model increasing with the informativeness of the meta-features (experiment 1).Importantly, when the meta-features are completely uninformative, the hierarchical ridge-lasso model performs only slightly worse than the standard ridge model (experiment 1, SNR=0).This shows robustness of the hierarchical ridge-lasso model to uninformative meta-features.
Experiment 2 shows that the gains in performance of the hierarchical ridge-lasso model over the standard ridge model can be quite large, particularly when the sample size is small.As the sample size  increases, the performance of both models increases and the difference between the two is reduced.
As the dimensionality p of the features increases (experiment 3), the performance of the standard ridge model deteriorates dramatically, while the performance of the hierarchical ridge-lasso model only decreases slowly as the information in the meta-features helps stabilize its performance.
In experiment 4, the performance of the standard ridge model does not change, as it does not utilize meta-feature information.However, for the hierarchical ridge-lasso model, the performance decreases as the number of noise meta-features increases (the number of informative meta-feature is set at 20% of the coordinates of  and the additional meta-features are noise meta-features).
The comparison between lasso-lasso model and standard Lasso model shares similar trends as those between ridge-lasso model and standard ridge model, except that they have lower prediction performances than their respective counterparts.This is not surprising that the Lasso excels in producing interpretable models, while the ridge does well in prediction.
We also examined the ability of the model to select informative meta-features by second-level Lasso penalty.In particular, we looked at the overall accuracy (percentage of alphas correctly shrunk to 0 or not shrunk to 0), true positive rate (percentage of non-zero alphas that are not shrunk to 0), and false positive rate (percentage of zero alphas that are not shrunk to 0) in experiment 1, where the second level meta-features informativeness varies (Figure 3).We see that as the informativeness of the meta-features increases, the overall accuracy increases, with the true positive selection rate of meta-features improves dramatically at the cost of a slight increase in the false positive rate.

Gene Expression Signatures for Breast Cancer Survival
To illustrate the performance of our approach, we applied the hierarchical survival model to the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) study (23).The METABRIC microarray dataset is available at European Genome-Phenome Archive with the accession of EGAS00000000083.It includes cDNA microarray profiling of around 2000 breast cancer specimens processed on the Illumina HT-12 v3 platform (Illumina_Human_WG-v3).The dataset was divided into a training set of 997 samples, and a test set of 995 samples (24).The goal is to build a prognostic model for breast cancer survival, based on gene expressions and clinical features.The data  consists of 29,477 gene expression probes and two clinical features, age at diagnosis and the number of positive lymph nodes.The meta-feature data  consists of four "attractor metagenes", gene co-expression signatures that are shared across many cancer types and are associated with specific cancer phenotypes.The shared features in cancer include, e.g., the ability of cancer cells to divide uncontrollably, to invade surrounding tissues, and, with the effort of the organism to fight cancer with a particular immune response (25).Three of the universal "attractor metagenes", mitotic chromosomal instability (CIN), mesenchymal transition (MES), lymphocyte-specific immune recruitment (LYM), were found associated with prognosis of breast cancer.In addition, a metagene whose expression is associated with good prognosis and that contains the expression values of two genes-FGD3 and SUSD3.The CIN, MES, and LYM metagenes each consist of 100 genes, but for our analysis, we only considered the 50 top-ranked genes.The data matrix  is an indicator matrix of whether a specific expression probe corresponds to a gene in a metagene.
Model building was based on the samples with ER positive and HER2 negative, as treatments are homogeneous in this group, and they are associated with good prognosis (26).There were 740 samples in the training set and 658 samples in the test set in the ER+ and HER2-subset after removing samples with missing values.We used repeated 5-fold cross validation to tune the hyper-parameters   1).The metagenes CIN and FGD3-SUSD3 were identified by the hierarchical model as being important (had higher absolute values of coefficients, ).
Metagene CIN, which is a breast cancer inducing metagene, had a positive coefficient, indicating genes in CIN had an overall increased risk over other genes, while the FGD3-SUSD3 metagene had a negative coefficient estimate, indicating FGD3 and SUSD3 had a reduced risk (Table 2).The identified metagenes were also found by previous analysis (24).

Anti-PD1 Immunotherapy Predictive Biomarker for Melanoma Survival
We also applied the model to a melanoma data set to predict overall survival after treating patients with a PD-1 immune checkpoint blockade.The programmed death 1 pathway (PD-1) is an immune-regulatory mechanism used by cancer to hide from the immune system.Antagonistic antibodies to PD-1 pathway and its ligands, programmed death ligand 1 (PD-L1), has demonstrated high clinical benefit rates and tolerability.Immune checkpoint blockades such as Nivolumab, pembrolizumab are anti-PD-1 antibodies showing improved overall survival for the treatment of advanced melanoma.However, less than 40% of the patients respond to the treatments (27).Therefore, predicting treatment outcomes, identifying predictive signals are of great interest to appropriately select patients most likely to benefit from anti-PD-1 treatments.
We explored transcriptomes and clinical data using our model to illustrate prediction performance and predictive signal selection.
The dataset combined 3 clinical studies in which RNA-sequencing were applied to patients treated with anti-PD1 antibodies, Gide et al., 2019 (28), Riaz et al., 2017 (29), Hugo et al., 2016 (30).The gene expression values are normalized toward all sample average in each study as the control, so that they are comparable to one another across features within a sample and comparable to one another across samples.There are 16010 genes in common across 3 studies and 117 subjects combined.We build predictive models in terms of overall survival, based on gene expression profile.Since the subjects are all treated with anti-PD1 antibodies, the transcriptomic features selected by the model are predictive signals for treatment efficacy or resistance.We selected meta-features from molecular signature database, hallmark gene sets.The hallmark gene sets involve biological pathways such as signaling, immune, proliferation.
They have been applied to analyses of cancer phenotypes of Medulloblastoma, Glioblastoma, and protein levels (31).13 gene sets are enriched to have false positive rates less than 0.25.An indicator matrix  is formed to illustrate whether each of the 16010 genes belong to one of the 13 hallmark gene sets.
We performed 5-fold cross validation to tune the hyper parameters and report the validation prediction performance.We see an improvement in prediction with the hallmark gene set information with a C-index of 0.663 for ridge-lasso compared to 0.637 for standard ridge.At the gene set level, 3 gene sets have absolute effect size larger than 0.01 (Table 3).Specifically, genes in response to interferon gamma, genes that are involved in KRAS regulation were identified.A subset of the genes in the identified gene sets by our model were in concordance with the previously published anti-PD1 gene signatures (29,30).

Discussion
In this paper we extended the regularized hierarchical regression model of Kawaguchi et al. to time-to-event data and to accommodate a Lasso or elastic-net penalty in the second level of the model.The hierarchical regularized regression model enables integration of external metafeature information directly into the modeling process.We showed that prediction performance improves when the external meta-feature data is informative.And the improvements are largest for smaller sample sizes, when prediction is hardest and performance improvement is most needed.Key to obtaining performance gains though is prior knowledge of external information that is potentially informative for the outcome.For example, clinicians, epidemiologists, or other substantive experts may provide insights into what type of annotations are likely to be informative.However, the model is robust to incorporating a set of meta-features that is completely irrelevant to the outcome of interest.In this scenario, a very small price in prediction performance is paid relative to a standard ridge model (i.e., without external information).This should encourage the user to integrate meta-features even if uncertain about their informativeness.
An underlying assumption of the proposed regularized hierarchical model is that the effects in a group determined by meta-features (e.g., genes in a pathway) are mostly in the same direction.A limitation of the method is that if the effects have opposite signs and 'cancel each other out' there would be little or no improvement in prediction, even if the pathway information is informative.
In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.However, there is no standard way to perform formal inference, i.e., standard errors, p-values, confidence intervals, with high-dimensional regression models.Several approaches exist (32,33) and this is an active area of research.Adding formal statistical inference would be an important future work to expand the range of use of the proposed model.
The regularized hierarchical Cox model is implemented in xrnet package and available to install via GitHub (34).The average computation times (mean computation time over 100 repetitions) of the situations in simulation experiment 3, N = 100, p = 200, 1000, 10000, q = 50, i.e., running 20×20 hyperparameter grid, are 0.9, 3.6, 57.4 seconds, respectively.The implementation was conducted on 8-core M2 CPU, with the operation system MAC OS 14.5.The implementation is efficient and can be used to perform analyses with large number of features, meta-features, and subjects, as is the case in METABRIC and anti PD-1 data applications in section 3.2.While the models we focused on in the simulation and data applications are mainly 'ridge-lasso', i.e., with an  2 norm penalty applied to  − , and an  1 norm applied to the meta-feature coefficients , the implementation offers the flexibility of using the Lasso, elastic net, and ridge penalties to penalize the meta-features depending on the application.As a result, different combinations of penalty types can be tuned to achieve optimal prediction performance, as well as tailored penalty types to the need of prediction or feature selection.For example, if selection at the meta-feature level is desired and the meta-features are highly correlated, the elastic net penalty is a better option for  regularization.Because if there is a group of variables that are highly correlated, the lasso tends to select one of them randomly, while the elastic net enjoys grouping effect which selects all the variables in a group with estimated coefficients close in magnitude (2).The approach does not perform feature selection on first level information as it uses a ridge penalty.In a high dimensional setting, standard regularized regression like the Lasso and elastic net often selects relatively large number of features.It can then be valuable to identify groups of genes defined by meta-features that may jointly have significant predictive power for the outcome of interest.Another potential improvement of the model is to extend the range of penalty types to non-convex penalties, such as SCAD (35), MCP (36).These penalties yield less biased effect size estimates than that of the Lasso and elastic net.

Conclusions
The proposed hierarchical regularized regression model enables integration of external metafeature information directly into the modeling process for time-to-event outcomes.Its prediction performance improves when the external meta-feature data is informative.Importantly, when the external meta-features are uninformative, the prediction performance based on the regularized hierarchical model is on par with standard regularized Cox regression, which should encourage the user to integrate meta-features even if uncertain about their informativeness.In addition to developing predictive signatures, the model can also be deployed in discovery applications where the main goal is to identify important features associated with the outcome rather than developing a predictive model.The developed R package written with C++, xrnet, is computationally efficient, accommodates large and sparse matrices, offers the flexibility of using the Lasso, elastic net, and ridge penalties to both omic features and meta-features.The equations above only calculate the sum once, then add at each sample index, which brings down the computation cost to linear time, ().Considering sorting observed times as a data pre-processing procedure, the overall computation time for the weights is ( log ).

Figure 3 .Figure 3 :
Figure 3. Simulation results: meta-feature selection () = ′  −  �    − ∑  �  ()  ≠ , is the partial residual excluding the contribution of ()  .(, ) is the soft-thresholding operator with value:(, ) = ()(|| − ) + = �  −    > , 0  || ≤ ,  +    < −, 1 ,  2 ).More than one value combination of  1 ,  2 are usually of interest, as  1 ,  2 are tuned by cross-validation to get the best performance out of the model.For the proposed model, a two-dimensional grid of  1 ,  2 values are constructed, and pathwise coordinate optimization ( [9]2, or in the transformation form in equation[9], ‖‖ 22. Model parameter solutions are usually initialized at  � = 0,  � = 0. Setting the starting value of  1 , i.e.,  1 1 ,  2 ) grid is comprised of a path of solutions corresponding to each combination of  1 ,  2 . 1 controls the amount of shrinkage to  2 term ‖ − 1 ,  2 in the training set, with 50 repetitions.The test set was used to evaluate model performance.The same training/test scheme was used to fit a standard ridge regression without attractor metagene information as comparison.With only gene expression features in the model and no clinical features, the mean test C-index for the ridge-lasso hierarchical model with metagene information was 0.658 (95% CI: 0.639, 0.677) which compares favorably with the mean test C-index of 0.639 (95% CI: 0.628, 0.650) for the standard Cox ridge counterpart.When adding the clinical features: age at diagnosis and number of positive lymph nodes, the mean test C-index increased to 0.734 (95% CI: 0.716, 0.752), and 0.728 (95% CI: 0.726, 0.730) for the Cox hierarchical model, and the standard Cox ridge model, respectively (Table

Table 1 .
Test C-index between standard ridge and ridge-lasso

Table 3 :
Coefficient estimates (non-zero, selected) for gene sets

A.1 Computation of diagonal elements of weight matrix
Diagonal elements of weight matrix , the Hessian of log Cox's partial likelihood function,   , has the form:The two sums in sets   and   both have  elements, hence it is a ( 2 ) computation.However, if we notice the difference between the two sets   and  +1 is the observations that are in   but not in  +1 , i.e., {:   ≤   <  +1 }, provided that the observed times  are sortedThe same cumulative sum idea can be applied to the set   : with respect to   is− 1  �     ( ′  −     −   ()  )Setting the gradient with respect to   to 0, the coordinate-wise update for   has the form: with respect to   is− 1  �   ()  ( ′  −     −   ()  )A similar expression exists if  �  < 0, and  �  = 0 is treated separately.Setting the gradient with respect to   to 0, the coordinate-wise update for   has the form: