Embracing Green Computing in Molecular Phylogenetics

Abstract Molecular evolutionary analyses require computationally intensive steps such as aligning multiple sequences, optimizing substitution models, inferring evolutionary trees, testing phylogenies by bootstrap analysis, and estimating divergence times. With the rise of large genomic data sets, phylogenomics is imposing a big carbon footprint on the environment with consequences for the planet’s health. Electronic waste and energy usage are large environmental issues. Fortunately, innovative methods and heuristics are available to shrink the carbon footprint, presenting researchers with opportunities to lower the environmental costs and greener evolutionary computing. Green computing will also enable greater scientific rigor and encourage broader participation in big data analytics.

Many biological disciplines apply computational approaches to investigate evolutionary questions involving the origins of genes, evolutionary relationships of organisms, positive and negative selection, the evolution of biodiversity, and genotype-phenotype connections across the tree of life. The importance of these questions is reflected by the escalating use of software for molecular evolutionary analyses ( fig. 1). Paradoxically, the means by which we explore the tree of life actually negatively impact that evolving tree of life, because computing has environmental costs. A computers' energy usage manifests into carbon dioxide emissions. Many scientists are seriously assessing the environmental cost of data analysis and the carbon footprint left by molecular evolutionary studies (Tao et al. 2019;Kumar and Sharma 2021;Alvarez-Carretero et al. 2022;Grealey et al. 2022). In particular, Grealey et al. (2022) have recently assessed the energy utilization and the associated carbon footprint of bioinformatics, including phylogenetic analysis and genome assembly.
Strategies are being developed to achieve energy savings in a quest for greener computing in the sciences and a healthier global ecology with health benefits to the general public (Jones 2018;Portegies Zwart 2020;Stevens et al. 2020;Strubell et al. 2020;Bender et al. 2021;Lannelongue, Grealey, Bateman, et al. 2021;Grealey et al. 2022). For example, cloud computing avoids idle time, as partial CPU and memory use in standalone computers wastes energy (Shehabi et al. 2016;Jones 2018). However, speeding up research computing through faster processors and parallelization demands extra energy and, thus, emits more greenhouse gases. Using idle GPUs to assist CPUs can also result in greener computing, but this approach depends on appropriate software implementations (Grealey et al. 2022). Interestingly, energy production has a much smaller carbon footprint in some countries (e.g., Norway and Switzerland), making them better locations for cloud computing .
Substantial reduction in energy costs can also be achieved by complementary means, which is the focus of this perspective. Here, I highlight conceptual and technical advances that can organically reduce computational time and memory of phylogenomics. I suggest that researchers choose methods, algorithms, and software practices that demand fewer compute cycles and less computer memory. These choices will diminish the carbon footprint of computational molecular evolution and be aligned with ecologically sound bioinformatic practices. These and future developments of resource-thrifty and accurate methods will amplify the impact of general strategies for greener computing.

Carbon Footprints of Phylogenetic and Phylogenomic Analyses
A standard protocol in molecular phylogeny is first to assemble a set of sequences and subject them to alignment procedures to establish base-by-base homology across sequences from different species and genes (Kumar and Filipski 2007). The resulting multiple sequence alignments (MSAs) become ready for molecular phylogenetics after proper postprocessing, including manual curation (Yang and Rannala 2012;Kapli et al. 2020).

Selecting the Optimal Model
In analyzing MSA, the usual first step is to estimate the substitution model that best describes the overall pattern of base changes. This analysis requires evaluating several models of nucleotide (or amino acid) substitution as well as models of rate variation across sites. Maximum likelihood (ML) tests of several nested and non-nested models under the Bayesian Perspective ß The Author(s) 2022. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons. org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
For example, an MSA of 1.3 million base pairs from 37 mammalian species took 106 CPU hours and 9.3 gigabytes (GB) of peak memory in ModelFinder to select the optimal model (Kalyaanamoorthy et al. 2017). According to the Green Algorithms (GA) resource , this analysis would require 1.6 kilowatt-hours (kWh) of energy and have a carbon footprint of 0.62 kgCO 2 e. GA suggests that a tree will take 20 days to scrub the environment of the greenhouse gasses emitted (table 1a1)! We can save more than 90% of the energy and, thus, emit less than 10% of the greenhouse gas by usingModelTest-NG (Darriba et al. 2020) and jModelTest (Posada 2008) that will produce similar results (table 1a). Recent machine-learning approaches also promise to provide green alternatives (Abadi et al. 2020;Burgstaller-Muehlbacher et al. 2021). Also, a machinelearning method for detecting autocorrelated evolutionary rates in a phylogeny (CorrTest; Tao et al. 2019) requires a small fraction of the energy used by a comparable Bayes factor analysis (table 1b).

Building a Molecular Phylogeny
Using an MSA and the best-fit substitution model, we can make a phylogeny representing the evolutionary histories of genes and species. ML and minimum evolution (ME) are two widely used model-based optimality principles for reconstructing phylogenetic trees (Nei and Kumar 2000). The neighbor-joining method (Saitou and Nei 1987), based on the ME principle and used in thousands of studies, has a negligible carbon footprint (table 1c3) compared with popular heuristic searches under the ML optimality criterion (table 1c1). Another approach that combines optimality criteria (FastTree) has an intermediate environmental impact (table 1c2). The accuracy of phylogenies produced by different techniques is comparable for many applications (Rosenberg and Kumar 2001;Price et al. 2010;Yoshida and Nei 2016), so researchers have many excellent options for reducing the environmental impact of their analyses.

Confidence Limits on Inferred Phylogenetic Groupings
Statistical evaluation of the robustness of inferred phylogenetic relationships is essential in evolutionary biology. Felsenstein's (1985) bootstrap resampling has been the preferred approach, but it is computationally intensive, requiring the inference of hundreds of phylogenetic trees for pseudo-MSAs generated by sampling sites with replacement from the full data set. This analysis has a rather large carbon footprint (table 1d1), as does its Bayesian alternative that produces posterior probabilities for inferred evolutionary relationships (table 1d5). Many approximate energy-efficient methods are now available for phylogenomic data sets, including the technique Little Bootstraps (Sharma and Kumar 2021) for long sequences, and ultrafast bootstrapping (Minh et al. 2013) and Rapid bootstrapping (Stamatakis et al. 2008) for data sets containing large numbers of sequences. These approximate methods have much smaller carbon footprints than standard approaches (table 1d). Combining different techniques  can save more than 99% in time, memory, and energy in testing the robustness of inferred phylogenies (table 1d4).

From Phylogenies to Timetrees
Another common phylogenetic analysis is the estimation of divergence times corresponding to speciations, gene duplications, and the evolution of new strains. Relaxed clock methods have revolutionized this practice (Kumar and Hedges 2016;Tao et al. 2020). Bayesian and RelTime methods produce estimates of similar quality (e.g., Barba-Montoya et al. 2020;Mello et al. 2021), but their energy requirements are dramatically different (table 1e). There is also a large difference in the carbon footprints imposed by slow and fast Bayesian implementations (table 1e). Consequently, researchers have a large spectrum of more environmentally friendly alternatives for molecular dating methods.

Green Software Implementations
Ultimately, efficient software implementation is the key to realizing the potential of all conceptional, methodological, and algorithmic innovations. The software design and resource utilization dictate energy consumption, so implementations that use less computer memory and time have a lower carbon footprint. Availability of software versions that can run on the cloud will also reduce carbon footprints. Another emerging area of improvement lies in creating stopping rules that can detect when further computing will not change the outcome significantly. For example, adaptive rules are being developed to automatically determine the number of bootstrap replicates needed for reliable confidence limits (Stamatakis 2014;Sharma and Kumar 2021). In the future, smarter software will avoid overcomputing, decreasing the carbon footprints of big data analyses.

Benefits beyond Environmental Sustainability
Computationally efficient analyses will also enhance the rigor of scientific research, reducing the resources required to assess the robustness of inferences to subsetting of data, choice of substitution models and strategies, and combining multigene data sets. Computationally efficient and economical computing will encourage researchers to evaluate the reproducibility of published results. The currently high computational demands of reproducibility studies put efforts to reproduce research results out of the reach of researchers lacking access to high-performance computing infrastructure.
Greener computing is also a key to addressing equity, diversity, and sustainability in scientific research and education. Green computing requires fewer compute cycles and less computer memory. It reduces the expense of computational hardware and the cost of on-demand calculations. Economical computing makes computational research accessible to a broader community, as the research funding for scientific investigations is limited. Greener computing, therefore, will uniquely address economic disparities among researchers due to their local constraints. Greener alternatives for molecular phylogenetic analysis will increase participation by researchers worldwide in molecular evolutionary research and the genomic revolution in biology.

Concluding Remarks
In the Anthropocene, where massive planetary changes are taking place because of human activity, computing is often thought of as a "clean" practice, when in fact, it can be quite the opposite. All branches of biology need to re-evaluate their practices in keeping with the underlying goal of studying life in the first place. For computational analyses, with the routine assembly of big data sets, analytical practices of the past hamper research by the need for excessive computing time and memory. These obstacles hinder both rigorous scientific investigations and wider participation in molecular phylogenetics. Large carbon footprints of many currently popular approaches have negative impacts on the environment, human health, and the sustainability of scientific computing. Fortunately, many accurate and resource-thrifty methods and algorithms are available for molecular phylogenetics. Applying these methods synergistically with computer hardware optimizations will help us achieve greater scientific rigor and broader participation while minimizing financial and environmental costs. I see a bright future for green computing in which conceptual and technical advances will further diminish the carbon footprints of increasingly complex phylogenomic analyses.

Supplementary Material
Supplementary information is available at Molecular Biology and Evolution online. NOTE.-The C-footprint (Carbon footprint) is the amount (g) of CO 2 released in the production of energy (kilowatt-hours, kWh) needed to power computers in the USA, estimated using the Green Algorithms website . Tree days are calculated based on the information that a mature tree can scrub 917 g of CO 2 e per day (Grealey et al. 2022). The Supplementary Material online provides details on software used and the options applied.