Somatic evolution and global expansion of an ancient transmissible cancer lineage

The canine transmissible venereal tumour (CTVT) is a cancer lineage that arose several millennia ago and survives by ‘metastasising’ between hosts via cell transfer. The somatic mutations in this cancer record its phylogeography and evolutionary history. We constructed a time-resolved phylogeny from 546 CTVT exomes and describe the lineage’s worldwide expansion. Examining variation in mutational exposure, we identify a highly context-specific mutational process that operated early but subsequently vanished, correlate ultraviolet-light mutagenesis with tumour latitude, and describe tumours with heritable hyperactivity of an endogenous mutational process. CTVT displays little evidence of ongoing positive selection, and negative selection is detectable only in essential genes. We illustrate how long-lived clonal organisms capture changing mutagenic environments, and reveal that neutral drift is the dominant feature of long-term cancer evolution.


4
The identification of somatic variants in CTVT is challenging due to the lack of availability of normal DNA from the 'founder' dog which initially spawned the lineage. In addition, CTVT tumours contain variable numbers of normal cells derived from the matched host. Thus, genetic variation detected within CTVT relative to the dog reference genome may represent somatic variation, germline variation from the CTVT founder dog, or germline variation from the matched host dog. In order to obtain a variant set enriched for somatic mutations, only variants that were present in at least one CTVT tumour, but which were not identified in any of the 495 matched host dogs that were sequenced as part of this study, were retained. Variants that were exclusively called in one or more of the unmatched tumours (Dataset S1) were not included in the candidate somatic set. After applying these filters, 148,030 SNVs and 12,177 indels were retained as candidate somatic variants (Table S1); this set is referred to as the 'CTVT candidate somatic variant set'.
It is important to note that this filtering approach will have also removed any somatic variant which occurred at the same position and involved the same base change as a germline variant within the host panel. The number of these was estimated by simulating 1,000 sets of 148,030 SNVs with the same mutational spectrum as that observed in the candidate somatic variants (see 'Mutational signature analysis' below). This analysis revealed that 1,257-1,485 (median 1,363) true somatic variants in CTVT may have been excluded for this reason.
The features of the CTVT candidate somatic variant set were assessed in order to determine the likely proportion of germline variants as follows: dN/dS. Analysis of the CTVT candidate somatic variant set indicated dN/dS ≈ 1 for missense and nonsense mutations (Fig. 3D). Given that germline variants are under strong negative selection (dN/dS = 0.439 and dN/dS = 0.158 for the sets of missense and nonsense germline variants identified in this study, respectively), these findings suggest that the candidate somatic set is indeed largely somatic. By including increasing numbers of germline single nucleotide polymorphisms (SNPs) in simulated neutral mutation sets (5% and 10% additional variants from a baseline of 148,030 simulated SNVs), a corresponding reduction in dN/dS was observed (Fig. S11, A-C; see 'Selection analysis' below). It should be noted that the removal of true somatic variants occurring at the same position as germline variants will artificially increase dN/dS. However, filtering those variants coinciding with germline variants within our host panel from the simulated mutation sets did not result in a significant deviation from dN/dS = 1 (Fig. S11D). Consistent with these observations, the CTVT candidate somatic set has a very distinct annotation profile to the set of germline variants identified in our host dog population (Fig. S12).
Private germline variants in the dog population. The number of private germline variants was measured for each dog within our panel of 495 host dogs. Between 0 and 895 (median 110) private germline variants were found per exome (Fig. S13), which provides a range estimate of the expected number of rare germline variants (from the CTVT founder dog) that may be present in the CTVT candidate somatic set. As the CTVT founder dog belonged to an under-sampled dog population (10), however, it should be noted that the number of private variants in the CTVT founder dog's germline may have been under-estimated by this approach.
Variant phylogenetic distribution. The majority of variants within the candidate somatic set were shared between subsets of tumours with phylogenetically informative patterns (Fig. S14). This pattern would not be expected for remaining CTVT founder germline variants (which would be found in all tumours) or for remaining matched host variants (which would be likely to be present in just one tumour, or phylogenetically discordant sets of tumours).
Mutational signature fitting. Germline variants are largely generated by two endogenous mutational processes, corresponding to mutational signatures 1 and 5 in COSMIC (48). However, germline variants, called with respect to the reference genome, show a distinctive mutational spectrum which reflects the non-directionality of these variants (i.e. the reference can be either ancestral or derived with respect to the variant set). The mutational spectrum of the germline variants called in a single host dog sample was used to define a 'germline mutational signature'. This was fitted to the CTVT mutational spectrum together with COSMIC signatures 1, 5, 7, as well as extracted signatures 2* and A (see 'Mutational signature analysis' below). This analysis suggested that germline contributed only 0.5% of mutations to the CTVT spectrum. However, it should be noted that rare germline variants are unlikely to be captured using this approach.
Variant allele fraction (VAF) distributions. Variants due to contamination from matched hosts have a distinct VAF profile compared with somatic variants. Furthermore, variants due to contamination from matched hosts would be most likely unique to a single tumour. For each tumour, the VAF profiles of variants unique to that tumour were assessed; this analysis did not suggest significant matched host contamination of this variant set.
Overall, these analyses suggest that the CTVT candidate somatic set is very highly enriched for true somatic variants. However, some minimal level of germline contamination may still remain.
Information on the distribution of variants is provided in Table S1. The code used for variant classification is available in GitHub (https://github.com/baezortega/TCG2019).

Variant annotation
Annotation of the complete variant set was performed using the Ensembl Variant Effect Predictor (40) v83. The Ensembl CanFam3.1 v83 gene annotation was used as a reference, and predicted variant consequences, CDS and protein positions, SIFT scores and other information were obtained for each variant. Variant annotation data were then post-processed to facilitate their use. The code used for this post-processing is available in GitHub (https://github.com/baezortega/TCG2019).
Copy number assignment Variant copy number was estimated using available data on the copy number alterations of two previously studied CTVT whole genomes (12). Because these copy number coordinates relate to a previous version of the dog reference genome (CanFam2), variant coordinates were remapped to CanFam2 using the NCBI Genome Remapping Service (http://www.ncbi.nlm.nih.gov/genome/tools/remap).
CanFam2-relative variant 6 coordinates were then matched against the available copy number data, obtaining copy number estimates for each variant. Integer gene copy number was estimated by calculating the rounded average of the copy number estimates of all the variants in each gene. A gene was then assigned a consensus integer copy number only if its two available integer estimates (one for each of the two tumours mentioned above) had consistent values. Genes with inconsistent estimates were not assigned a consensus copy number, and were not considered when defining gene groups based on copy number (see 'Selection analysis' below).

Tumour purity estimation
The purity (cancer cell fraction) of each tumour sample was calculated as twice the mode of the VAF distribution of somatic variants in the sample: in a perfectly pure tumour, the mode of the VAF distribution of somatic variants would be 0.5 (as most variants are heterozygous, and the CTVT ploidy is centred on 2 (12)). The purity distribution of the 546 tumour samples sequenced had a median of 0.59 and an interquartile range of 0.18.
Somatic mutation prevalence analysis Somatic mutation prevalence in CTVT exomes (Fig. 3A) was calculated based on the somatic mutations identified in the genomic regions used for exome pull-down (see 'Exome pull-down and sequencing' above). Somatic mutation prevalence in human cancer exomes was calculated based on the mutational catalogues published by Alexandrov et al. (5) (available at ftp://ftp.sanger.ac.uk/pub/cancer/AlexandrovEtAl), assuming that an average human exome has 30 Mb in protein-coding genes with sufficient coverage. The code used in this analysis is available in GitHub (https://github.com/baezortega/TCG2019).
Phylogenetic inference: Maximum-likelihood topology inference An initial phylogenetic topology was obtained using RAxML (41) v8.2.9. A sequence alignment template (32 Mb) was constructed by concatenating the coding sequences (CDS) of all the coding genes in the CanFam3.1 reference genome. The alignment was composed of 540 sequences: the reference sequence and 539 tumour sequences. The remaining 7 tumours (148T, 759T, 808T, 860T, 1426T1, 1478T, 1479T) did not have sufficient quality for them to be further considered.
For each of the tumours, its somatic variants were introduced on a copy of the concatenated CDS reference sequence. Before introduction of each variant, the power to successfully call a true variant at each individual variant locus in each tumour was calculated. If power for a given variant in a tumour is insufficient, the IUPAC symbol representing ambiguity between the reference and mutated alleles was imposed. Only if power was considered sufficient the variant was introduced as either the reference or the mutated allele. Sufficient power was defined as power to identify 6 supporting reads. For each variant in each tumour, the expected number of supporting reads in a heterozygous situation was estimated using tumour purity and read depth at the variant locus, as (read depth × tumour purity / 2). If the expected number of reads was <6, then the corresponding IUPAC ambiguity character was introduced. Otherwise, the mutated allele was introduced if there were ≥3 reads supporting the variant; otherwise, the reference allele was kept. Alleles that were called in the transcribed strand of the gene were complemented before variant introduction.
Phylogenetic tree inference was performed on the resulting 540-sequence alignment using RAxML (41) v8.2.9, and a maximum likelihood (ML) tree was inferred using a generalised time reversible (GTR) substitution model with a Gamma model of site heterogeneity. Any columns in the alignment containing only undetermined ('N') sites were removed by RAxML. The topology of the resulting ML tree was contrasted with results from previous mitochondrial genome studies of CTVT (11) and found to be highly concordant.
Phylogenetic inference: Time-of-origin estimation An interval estimate for the time of origin of CTVT (i.e. the age of the CTVT founder tumour) was obtained using a Poisson Bayesian model which incorporates information about clonal somatic variation from a case of direct CTVT transmission from mother to pup during birth (10). Although transmission of CTVT during birth is not frequently reported, any inferences about the clonal somatic mutation rate should apply to CTVT tumours in general. The known data in this model consisted of (i) the ages of mother and pup, (ii) the number of clonal somatic C>T mutations at CpG dinucleotides ([C>T]pG) that were shared between the respective tumours from mother and pup, and (iii) the numbers of clonal somatic [C>T]pG mutations that were private to each tumour (10). Only [C>T]pG variants were used, as these have been proposed to be caused by a constant, clock-like mutational process (9). Importantly, this model assumes (i) that the [C>T]pG mutation rate in CTVT has remained constant over the lineage history, (ii) that CTVT transmission took place via monoclonal seeding (either by a single cell, or by identical cells from a single clone), (iii) that the mother was infected with CTVT around the time when the pup was conceived (as suggested by somatic variation data (10)), and (iv) that the common ancestor of the tumour cells sampled in mother and pup arose in the mother.
The parameters and data in the model were denoted as follows. µ is the mutation rate, t0 is the time of origin of CTVT, tMRCA(M,P) is the time of the MRCA of the tumour cells in both mother and pup, tMRCA(M) is the time of the MRCA of the tumour cells in the mother, tMRCA(P) is the time of the MRCA of the tumour cells in the pup, t1 is the time elapsed between t0 and tMRCA(M,P), tM is the time elapsed between tMRCA(M,P) and tMRCA(M), tP is the time elapsed between tMRCA(M,P) and tMRCA(P), m1 is the number of clonal somatic [C>T] From the assumptions above, tMRCA(M,P) was bounded on one side by the time of birth of the pup (10 months ago), and on the other side by the time when the mother likely reached sexual maturity (2.5 years ago, based on the mother's age of 3 years) (10). The time of birth was also a hard upper bound on both tMRCA(P) and tMRCA(M); this ensured that tMRCA(M,P) > tMRCA(M). The prior on the mutation rate was exponential with mean of 40 mutations per genome per year [Exponential(1/40)]; this was a rough expectation of the [C>T]pG mutation rate based on a published estimate of 16.61 mutations per Gb per year for the mutation rate in human cervical cancer (9). The prior on the time of origin was uniform between 200 years ago (the time of the oldest known written record of CTVT (3)) and 40,000 years ago (the upper bound of wolf-dog divergence time estimates (49,50)). The prior on tMRCA(M,P) was Exponential (18), shifted to the lower bound of the parameter (10 months) and truncated at its hard upper bound (2.5 years); this distribution places 95% of its weight on the 2-month interval between conception and birth. The prior on tMRCA(M) and tMRCA(P) was Exponential(3.6), which places 95% of its weight on the 10 months before the sampling date, and was truncated at its hard upper bound (10 months).
Mutation count data (m1, mM and mP) had observed values of 221,370, 27 and 23 mutations, respectively (10), and were modelled using a Poisson likelihood as follows.
Phylogenetic inference: Phylogenetic tree dating Because the somatic variants employed to infer the RAxML phylogenetic tree (see 'Maximum-likelihood topology inference' above) were the result of various mutational processes, branch length estimates were independent of biological time. Correct branch length and divergence time estimates, including measures of uncertainty, were obtained using the BEAUti and BEAST (42) v1.8.4 software. The tree inferred by RAxML was used as the starting tree, and tree rearrangement operators were disabled to avoid modification of the topology. A new alignment was generated, containing all the CpG sites across the sequenced exome regions. Only C>T transitions at CpG dinucleotides ([C>T]pG or Cp[G>A]) were introduced in each tumour's sequence, using the power calculation strategy described above (see 'Maximum-likelihood topology inference' above). This alignment was imported into BEAUti (42) and two monophyletic sets of taxa were defined: one for the reference sequence, and one for the tumours. The prior for the tree height was defined as normal with a mean of 6,912 and variance of 1,410; this resulted in 95% of the distribution weight placed on the interval [4,148,9,676], consistent with the estimated time of CTVT origin (see 'Time-of-origin estimation' above). The prior for the clock rate was set to Exponential(0.5). Because only a single mutation type (C>T) is considered, a Jukes-Cantor substitution model with a Gamma model of site heterogeneity (and the default Exponential(0.5) prior for the 'alpha' parameter) was selected. To avoid considering the reference sequence as a present-day taxon (which would result in somatic variants being assigned to the branch between the root and the reference sequence), the sampling date for the reference sequence was set to -4,147 years (i.e. 4,147 years before present, coinciding with the lower end of the prior on tree height), with a precision of 0 years. This reflected an interpretation of the reference taxon as corresponding to the founder dog, as somatic variation had been called with respect to it. This also forced the length of the branch between the tree root and the reference sample to be as short as possible, while allowing the posterior distribution of the tree height to take values ≥4,147 years. The Markov chain Monte Carlo (MCMC) chain length was set to 20 million states, with parameters being logged every 1,000 states. The MCMC samples produced by BEAST (42) were assessed and summarised into a consensus tree using software packages Tracer v1.6.0 (http://beast.community/tracer) and TreeAnnotator v1.8.4 (http://beast.community/treeannotator).
Phylogenetic inference: Population dynamics inference Inference of past CTVT population dynamics was performed using a Bayesian coalescent skyline model (52) in BEAST (42) v1.8.4. The same alignment, starting tree and configuration employed in phylogenetic tree dating (see 'Phylogenetic tree dating' above) were used, except for the following details: (i) the tree prior was chosen as Coalescent: Bayesian Skyline, with 20 groups and a piecewise-constant skyline model, (ii) the MCMC chain length was set to 100 million states, with parameters being logged every 10,000 states. Tracer v1.6.0 (http://beast.community/tracer) was used to assess the MCMC samples produced by BEAST and perform a Bayesian skyline reconstruction analysis.
Definition of phylogenetic groups Monophyletic groups of tumours were defined based on the topology of the inferred tree (see 'Phylogenetic tree inference' above). Fifty-eight non-overlapping tumour groups were defined, which offered maximum phylogenetic resolution while retaining enough group-unique variants as to provide sufficient power for extraction of mutational signatures. Groups were assigned a name, code, set of samples and set of group-unique variants. The latter were defined as somatic variants present (i.e. supported by ≥3 reads) in at least one sample of the group, and not present in any samples outside the group. Three ancestral variant groups (groups A1, A2 and A3) were also defined, being composed of variants ancestral to specific subtrees (Fig. 1A); these were variants not found outside the samples within the relevant subtree of the tree, and present in enough samples as to be considered ancestral to all the branches in the subtree. Phylogenetic group information is provided in Dataset S3.
The phylogenetic groups were defined so that they provided sufficient power for reliable estimation of mutational signature exposures (see 'Mutational signature analysis' below). Power to estimate signature exposures was assessed by selecting large monophyletic tumour groups (>2,000 group-unique variants) and comparing the results of (a) fitting mutational signatures from COSMIC (https://cancer.sanger.ac.uk/cosmic/signatures) to the complete sets of group-unique variants in each group, and (b) fitting the same signatures to down-sampled group-unique variant sets, with down-sampling ratios ranging from 0.01 to 0.90. Similarity between the results was assessed using the L1 distance between the vector of signature exposures obtained from the complete variant sets ('original' exposures) and the vector obtained from the down-sampled variant sets.
The minimum number of group-unique variants needed for good overall replication of the original exposures was defined as 700 (with further group-exclusive variation improving power). Over 95% of the down-sampled sets with ≥700 variants resulted in exposures whose L1 distance to the corresponding original exposures was <0.1. For phylogenetic groups in which the variation ancestral to all tumours in the group could not be confidently assigned to the same geographical area where the tumours in the group were sampled (groups 18, 19, 37, 40, 44, 51 and 52), such ancestral variation was discarded from the set of group-unique variants. Phylogenetic groups having <700 unique variants, and which could not be merged with another group (groups 26, 39, 47 and 55), were excluded from any subsequent analysis of mutational signature exposures. The resulting phylogenetic groups harboured a median of 1,235 unique variants per group.
Some minor phylogenetic branches which had an insufficient number of groupunique variants (including branches composed largely or exclusively of tumours without a matched host), and which could not be coherently incorporated into adjacent phylogenetic groups, were not defined as phylogenetic groups and excluded from all group-level analyses (some of these samples are displayed in Fig. 1, C-D, as coloured circles without a group number).
Mutational signature analysis Mutational signatures were extracted using the sigfit (14) v1.1.0 package in R (v3.3.3) using the 61 tumour variant groups described above. This package performs Bayesian statistical estimation, via Markov chain Monte Carlo (MCMC) sampling using a Dirichlet-Multinomial statistical model that is statistically equivalent to the mathematical formulation of mutational signatures used in widely used non-negative matrix factorisation (NMF) approaches (53,54).
Specifically, 96 trinucleotide mutation categories were defined for SNVs based on substitution type (6 categories) and trinucleotide sequence context (i.e. the immediately adjacent 5' and 3' bases; 16 categories). SNVs were collapsed so that only pyrimidine reference bases (C or T) were considered (e.g. G>A and C>T represent the same change in opposite strands, so both can be expressed as C>T). A vector of mutation counts or mutation probabilities across these 96 mutation categories is known as a (trinucleotide) mutational catalogue or spectrum (5,54). In the Dirichlet-Multinomial model implemented in sigfit (14), the mutational signatures are interpreted as the probability parameters of independent multinomial distributions, and the observed mutational spectra are treated as draws from a mixture of these distributions. The signature exposures (i.e. the proportion of mutations attributed to each signature) act as mixture coefficients.
Mutational signatures were extracted from sets of somatic variants exclusive to each phylogenetic group (group-unique variants; see 'Definition of phylogenetic groups' above) using the model described above for all values of N between 2 and 10. The only resulting set of signatures that satisfactorily explained the data and closely resembled some of the signatures in the COSMIC catalogue (http://cancer.sanger.ac.uk/cosmic/signatures) was achieved for N = 3. These signatures, which were highly similar to COSMIC signatures 1, 5 and 7 (and hence were named using this nomenclature), together accurately explained the mutational spectra of all phylogenetic groups except five: the 'ancestral' groups A1, A2 and A3, and the Indian groups 57 and 58. As these five groups presented distinctive mutational spectra (Fig. S4), sharing features that were not well explained by the three extracted signatures, an additional fitting-extraction model was applied to identify any additional mutational signatures. This model, which is adapted from the one presented above, performs fitting of a given set of NF known mutational signatures (i.e. only the exposures to each signature are estimated), and simultaneously attempts to explain any outstanding components of the groups' mutational spectra by extracting NA additional signatures. In this case, the smallest value of number of additional signatures which could explain the outstanding components in the five groups' spectra was NA = 2. The resulting two signatures were named signature 2* (for its similarity with COSMIC signature 2) and signature A (for being exclusively found in the ancestral groups).
Finally, definitive signature exposures were more accurately estimated by refitting the set of extracted signatures to all phylogenetic groups. Signature fitting was performed using the Dirichlet-Multinomial model in sigfit (14) v1.1.0. Initially, all 5 signatures were fitted to the group-unique variants in each group; a second fitting was then performed, in which signatures 1, 5 and 7 were fitted to all groups, whereas signatures 2* and A were fitted only to those groups showing significant exposure to them.

Dinucleotide variant analysis
Dinucleotide variant calls were extracted from the original sets of variants called in each individual sample. Phased dinucleotide variants where both substitutions had been called together (in the same sample) were selected. The dinucleotide calls from all samples were merged into a single set and genotyped across all samples, in the same manner as for SNVs and indels (see 'Variant calling pipeline' above). Genotyped dinucleotide variants were categorised as candidate somatic and germline (host) variants (see 'Variant classification' above).
Strand-wise dinucleotide mutational spectra were obtained by categorising each variant according to base substitution type (for both bases) and transcriptional strand (transcribed or untranscribed). Dinucleotide variants were collapsed so that only one of the bases in each mutated Watson-Crick pair is considered (e.g. CA>TC and TG>GA represent the same dinucleotide change in opposite strands). Of the 16 possible dinucleotides, 12 can be reverse-complemented in this way. These collapse into 6 dinucleotides, namely AA/TT, AC/GT, AG/CT, CA/TG, CC/GG, GA/TC, resulting in 54 mutation types (6 dinucleotides, each with 9 mutation possibilities).
The remaining four dinucleotides (AT, TA, CG, GC) are identical to their reverse complements. For these four dinucleotides, there are only six mutation possibilities, as three of the possibilities are identical if reverse complemented (e.g. TA>AT reversecomplements into itself). This results in an additional 24 mutation types. Hence, the total number of unique mutation types after collapse is 78. Dinucleotide mutational spectra were obtained by counting the variants of each of type in each strand, for each of the phylogenetic tumour groups (see 'Definition of phylogenetic groups' above). The mutational spectrum of all somatic dinucleotide variants is presented in Fig. S15.

UV-latitude association analysis
The association between exposure to signature 7, number of CC>TT dinucleotide mutations and absolute average latitude of each phylogenetic group was assessed using noise-robust Bayesian statistical models for simple regression and estimation of the correlation coefficient. The studied variable pairs were (i) group-specific signature 7 exposure and number of group-unique CC>TT mutations (both of which are linked to ultraviolet light-mediated DNA damage (5,21)) and (ii) number of group-unique CC>TT mutations and absolute mean latitude of the locations where the tumours in each group were collected (Dataset S3).
For the UV-latitude association analysis, the number of group-unique CC>TT mutations was preferred over signature 7 exposure, as it constitutes a more stable indicator of the extent of UV-mediated damage. Group-unique CC>TT mutation numbers were normalised to account for differences in the size and number of group-unique variants between phylogenetic groups (see 'Definition of phylogenetic groups' above). Normalised numbers of CC>TT mutations were calculated for each group as the ratio between the number of group-unique CC>TT mutations and the number of group-unique C>T changes at CpG dinucleotides. Only phylogenetic groups in which all tumours were collected in the same country or geographic area, and hence in which the most recent node was inferred to occur within that country or geographic area, were considered in the analysis (i.e. groups 1, 36, 41, 42, 43, 45, 46, 50, 53, A2 and A3 were excluded). In cases where tumours were collected in the same region, but at different latitudes, the average latitude was used. Group A1 was used for absolute average latitude prediction (see below), but was not considered for estimation of correlation or regression.
Because the data supported a non-linear, roughly logarithmic relationship between absolute mean latitude and normalised CC>TT mutations (Fig. 2D), logarithmic regression was used to model this association. A robust Bayesian statistical model for simple logarithmic regression was implemented, which employs a Student's t-distribution to achieve robustness to outliers in the data. The values for the response variable are modelled as arising from a t-distribution with location (mean) equal to the formula for simple logarithmic regression on the explanatory variable. The degrees of freedom of the t-distribution are modelled as a parameter with its own prior distribution, conferring adaptability to non-normally distributed data. Reasonably non-informative priors were imposed on all model parameters (normal for the intercept and slope of the regression line, half-normal for the spread of the t-distribution, and gamma for the degrees of freedom). This model is similar to robust regression models described elsewhere, and is formalised below.
X is the explanatory variable, α, β are the parameters of the regression curve, σ is the spread of the t-distribution, ν is the degrees of freedom of the t-distribution. For the analysis of the association between signature 7 exposure and (unnormalised) CC>TT mutations, an analogous Bayesian model for robust linear correlation was applied, as the data supported a linear relationship between these variables (Fig. S7). This model is identical to the one described above, except for the location of the t-distribution for Y, which is (α + βX) instead of (α + β ln(X)).
Samples from the posterior distributions of the models were drawn via Markov chain Monte Carlo (MCMC) sampling, making use of the No-U-Turn sampler implemented in the Stan (51) platform through the interface provided by the rstan v2.16.2 package (http://mc-stan.org/rstan) in R (v3.3.3). The number of group-unique CC>TT dinucleotide mutations was defined as the explanatory variable (X) for both signature 7 exposure and absolute mean latitude. Regression lines with 95% HPDI (calculated from the posterior predictive distribution of the mean response), as well as a 90% prediction interval for the absolute mean latitude of group A1 (calculated from the posterior predictive distribution of Y; Fig. 2D), were obtained from the robust regression model.
Similarly, estimation of the correlation coefficient was performed using a robust Bayesian model that adopts a bivariate t-distribution. The values for the two variables are interpreted as arising from a multivariate (in this case, bivariate) t-distribution with a covariance matrix that is consistent with the definition of Pearson's correlation coefficient as the degree by which the two marginal distributions vary together. As before, the degrees of freedom are modelled as a parameter with its own prior distribution, conferring inherent adaptability to non-normally distributed data. Reasonably non-informative priors were imposed on all model parameters (normal and half-normal, respectively, for the locations and spreads of the marginal t-distributions, gamma for the degrees of freedom, and uniform for the correlation coefficient). The model is formalised below. [ where X, Y are the two observed variables, µ1, µ2 are the locations of the marginal t-distributions, Σ is the covariance matrix of the bivariate t-distribution, ν is the degrees of freedom of the bivariate t-distribution, σ1, σ2 are the spreads of the marginal t-distributions, ρ is the correlation coefficient. For the association between signature 7 exposure and CC>TT mutations (Fig. S7), a robust estimate of Pearson's correlation coefficient was obtained via MCMC sampling from this model, as described above for the regression models. For the non-linear association between absolute mean latitude and normalised CC>TT mutations (Fig. 2D), a robust estimate of Spearman's correlation coefficient, which does not assume linearity, was obtained as the correlation between the sample ranks of the values for both variables; these were obtained via the 'rank' function in R.
Recurrent mutation analysis: Discovery of candidate recurrent mutation events Candidate recurrent mutation events were discovered using a bespoke computational pipeline (mutree v2.7182), which builds on the software RAxML (41) v8.2.9 and treesub v0.2 (https://github.com/tamuri/treesub). An input sequence alignment (for phylogenetic inference) was constructed by concatenating the coding sequences (CDS) of all the coding genes in the CanFam3.1 reference genome, such that the entire sequence remained in frame. To select a set of high-confidence samples and variants, the following minimum thresholds were defined: tumour purity ≥0.49 (see 'Tumour purity estimation' above); tumour 'QC-index' ≥0.20 (defined as the product of purity and fraction of sequenced bases with depth >30×; Dataset S2); read depth ≥30× at the variant locus across all considered samples; and ≥4 reads supporting variant presence. The threshold of ≥4 supporting reads is three reads below the expected number of supporting reads for a heterozygous variant, given by minimum tumour purity × minimum read depth / 2 = 0.49 × 30 / 2 ≈ 7. In addition, any variant with exactly 3 supporting reads in any tumour was discarded; variants with <3 reads in a tumour were allowed in order to avoid sequencing noise from causing real variants to be excluded.
These selection criteria produced a high-confidence set of 420 tumours and 41,335 coding somatic variants. These were imposed onto the aligned sequences via the same procedure used prior to inference of the phylogenetic tree (see 'Phylogenetic tree inference' above). The steps of the recurrent mutation discovery pipeline are described below.
1. Input processing. Two additional versions of the alignment were produced: one by removing those sites containing undetermined bases ('N') in all the sequences; and another by concatenating only those codons containing variable sites.
2. Maximum likelihood tree inference. A maximum likelihood (ML) phylogenetic tree was inferred from the input alignment using RAxML. The following options were specified: '-f a' (rapid bootstrap and search for the best-scoring ML tree), '-m GTRGAMMA' (GTR substitution model with Gamma model of rate heterogeneity), '-# autoMRE' (extended majority rule consensus tree stop criterion for bootstrap), '-x 931078' (random number seed for bootstrap), and '-p 272730' (random number seed). The resulting ML tree was subsequently rooted on the reference sequence.
3. Ancestral sequence reconstruction. Marginal reconstruction of ancestral sequences in the inferred tree was performed using RAxML. The modified alignment composed only of variable codons (see above) was used to decrease computational cost.
Options '-f A' (calculation of marginal ancestral states) and '-m GTRGAMMA' (see above) were specified.
4. Tree annotation and recurrent mutation identification. Treesub was used to annotate the mutations occurring in each branch of the tree, based on the reconstructed ancestral sequences. Mutations were assessed for the corresponding amino acid change. A table of gene coordinates was used to map each mutation to a gene, and any group of two or more nonsynonymous changes affecting the same gene were marked as potentially recurrent.
Recurrent mutation analysis: Filtering and validation of identical recurrent mutations A potentially recurrent mutation was defined as either (a) a nonsynonymous mutation in a gene harbouring at least one other nonsynonymous change, or (b) any coding mutation with two or more seemingly independent occurrences in the tumour tree (i.e. reported by mutree more than once); the latter are referred to as 'identical recurrent' mutations.
Initial filtering was applied to the set of candidate recurrent mutations identified by mutree (see 'Discovery of candidate recurrent mutation events' above) to remove backward and nested mutations. A candidate recurrent mutation was classified as 'forward' if the corresponding SNV call was found in the same samples as the candidate mutation. If the SNV was found in the samples without the candidate recurrent mutation, this was considered a 'backward' mutation. A 'nested' mutation was defined as a forward mutation occurring in a subtree which hangs from a node with the complementary backward mutation; these are caused by errors in marginal ancestral state reconstruction.
Consequently, the following sets of candidate recurrent mutations were discarded: (i) any backward mutation, as well as any complementary forward mutations nested within it; (ii) any mutation with median VAF <0.05 across samples with ≥4 supporting reads; (iii) any mutation found alone in a gene after applying the two previous filters.
After the initial filtering described above, further filtering and validation was performed on the set of candidate identical recurrent mutations. The following sets of candidate mutations were discarded: (i) any group of two or more mutations supported by ≥3 reads in the exact same set of tumours; (ii) any mutation supported by ≥3 reads in ≥1 host sample not matched to a tumour presenting the mutation (hosts with >1,000 low-VAF somatic variants were excluded); (iii) any mutation supported by ≥5 reads in ≥1 host matched to a tumour presenting the mutation (hosts with >1,000 low-VAF somatic variants were excluded); (iv) any mutation found only in the following set of tumours (or a subset of them), which showed signs of moderate tumour-to-tumour DNA contamination: 12T, 13T, 1322T, 25T; (v) variants in olfactory receptor (OR) genes, genes predicted to be members of OR gene families, or genes belonging to the VPS13D or PKD1L1 gene families, all of which were prone to artefactual variants caused by read misalignment.
The number of identical recurrent variants after filtering was 412. These were subsequently validated in silico via visual inspection in JBrowse (46) v1.11.6 and categorised into the following groups: (i) real recurrent mutations present in ≥2 independent tree branches (99 mutations); (ii) candidate recurrent mutations present in one tree branch and ≥1 singleton sample (141 mutations); (iii) candidate recurrent mutations present only in singleton samples (55 mutations); (iv) false positive or lowconfidence mutations attributed to diverse causes (117 mutations). The resulting set of 295 filtered and validated identical recurrent mutations is provided in Dataset S6.
Recurrent mutation analysis: Estimation of the expected distribution of identical recurrent mutations Probabilistic and simulation analyses were conducted with the aim of comparing the observed number of recurrent mutations of each type (see above) with the number that would be expected under neutrality. The factors influencing the distribution of mutation types under neutrality were assumed to be (i) the mutation probabilities per mutation type (reflected in the observed mutation counts), and (ii) the mutational opportunities per mutation type. The influence of selection and factors that can affect mutation rate, such as expression, DNA structures or chromatin state, were thus disregarded.
Mutational opportunities were obtained as the frequencies (across the exome) of the 32 trinucleotide types associated with the 96 mutation types (see 'Mutational signature analysis' above). As for the mutation probabilities, because recurrent mutations can only be identified post-divergence (i.e. after the basal trunk and first basal node of the phylogenetic tree), the mutational spectrum of the entire post-divergence somatic variant set (i.e. variants outside group A1; 125,398 SNVs) was used to obtain the mutation counts per mutation type.
The determination of the probability of at least one recurrent event in a set of NC mutations of type C (with C = 1, …, 96), with corresponding mutational opportunity OC, corresponds to a generalisation of the popular 'birthday paradox' as a collision problem (55). (The birthday paradox concerns the probability that, in a set of N randomly chosen people, at least two will have the same birthday.) Such collision-problem formulation implies that, given NC random integers drawn from a discrete uniform distribution with range [1, OC] (where NC ≤ OC), the probability p(NC, OC) that at least two numbers are the same is The probability of observing at least one recurrent mutation was thus calculated using the mutation counts (NC) and opportunities (OC) as described above. This probability was found to be below 0.5 for all mutation types excepting C>T transversions, for which probabilities were all close or equal to unity (Fig. S16A).
In addition to recurrence probabilities, the expected number of recurrent mutations of each type, given the observed mutation counts and opportunities, was estimated through two different approaches. The first of these infers the expected recurrent mutation counts from the collision problem as follows. If the probability that the k-th integer (mutation) randomly chosen from [1,OC] will repeat at least one previous choice is then, as NC such integers are randomly chosen from [1,OC], the expected total number of times a selection will repeat a previous selection equals It should be noted that this formula does not count the total expected number of unique recurrent mutations. Rather, it counts the number of mutations which are identical to an already existing mutation at the time when they occur, and thus ignores the first occurrence of each recurrent mutation. (For the derivation of the formula, see http://matt.might.net/articles/counting-hash-collisions.) The second approach followed was to simulate 100,000 random draws of 125,398 mutations each, taking into account the observed mutation counts and opportunities. The median numbers of recurrent mutations observed in these simulated sets are virtually identical to those obtained via the formulation presented above (Fig. S16, B and C). Contrasting these estimates to the actual numbers of observed identical recurrent mutations of each type (ignoring first occurrences) revealed that, although the total number of (filtered and validated) observed recurrent mutations is slightly less than expected, the observed distribution presents more N[C>T]G mutations than expected, and less of most other C>T types (Fig. S16D). This suggests that some CpG sites have a higher mutability than predicted, probably reflecting CTVT methylation status. Interestingly, there are 8 cases of recurrent non-C>T mutations, which are not expected to occur under the conditions assumed in the described approach (Dataset S6).
Faceted analysis of somatic mutation burden Somatic mutation burden was analysed across categories defined on the basis of mutation type, genomic region, phylogenetic period, gene expression and transcriptional strand. Somatic variants were first decomposed into (i) exonic or intronic, on the basis of their overlap with the CanFam3.1 gene and exon coordinates, which were retrieved via the biomaRt (61) v2.30.0 package in R (v3.3.3), considering all coding genes and exons; (ii) pre-or post-divergence, on the basis of their belonging to the set of basal trunk ancestral somatic variants (group A1; Fig. 1A); (iii) five gene expression quintile categories and ten gene expression quintile categories, according to the relationship between the average transcript abundance of the affected gene and the transcript abundance quintiles and deciles (see 'Gene expression analysis' above); and (iv) mutation type categories: 96 trinucleotide mutation categories for SNVs (see 'Mutational signature analysis' above) and two categories (insertions and deletions) for indels.
Mean SNV and indel burdens (in mutations per kilobase or megabase) were calculated for each intersection of categories, by dividing the total number of variant counts in each category intersection by the total length of considered sequences in the same category intersection (an intersection of categories could be, e.g., 'pre-divergence intronic deletions in genes within the lowest expression quintile category'). Confidence intervals on the mean burdens were calculated using the standard formula.
For some SNV types, which were identified as major features in the spectra of previously inferred mutational signatures ( Fig. 2; see main text and 'Mutational signature analysis' above), somatic burdens were calculated separately for each transcriptional strand (transcribed and untranscribed) across the ten gene expression decile categories. These SNV types were (i) C>T changes at NCG trinucleotides (N[C>T]G changes), associated mainly with signature 1; (ii) C[C>T]C changes, associated mainly with signature 7; and (iii) T[C>T]C changes, associated mainly with signatures 7 and A. In addition, mutational signatures were refitted to the sets of pre-and post-divergence variants in each gene expression quintile category, using the Bayesian model described above.
Characterisation of hemizygous essential splice-site mutation in PTEN During assessment of annotated missense mutations occurring in the basal trunk of the CTVT phylogenetic tree (group A1, Fig. 1A; see 'Definition of phylogenetic groups' above), and located in haploid regions (indicated by both copy number estimates having a value of 1; see 'Copy number assignment' above), a misannotated essential splice-site variant was identified at position -1 relative to the start of exon 2 in gene PTEN (G>A variant at position 26:37,853,147). This variant had been misannotated as missense due to an error in the annotation of PTEN in the canine reference genome (CanFam3.1). Multiple alignment of this region of the dog genome against the horse, cat, mouse and human reference genomes in Ensembl (http://www.ensembl.org/Canis_familiaris) revealed an assembly gap in CanFam3.1 which spans the beginning of PTEN. As a result, the first exon of PTEN is absent from the genome, and the second exon is instead misannotated as the first one. In addition, the start of this misannotated exon (which in CanFam3.1 also corresponds to the start of the gene) is annotated slightly upstream of where it should. Specifically, the exon start is annotated at position 26:37,853,135, while the multi-species alignment indicated that it should start at 26:37,853,148, such that the mutated guanine base at 26:37,853,147 (immediately upstream of the exon start) is an essential splice site in all the examined annotations except CanFam3.1.
The VAF of the identified variant splice-site variant across the set of CTVT tumours was found to be consistent with hemizygosity (median VAF 0.427, interquartile range 0.216), and variation in VAF between tumours was found to be explained by differences in tumour purity (VAF-purity correlation coefficient 0.877; see 'Tumour purity estimation' above). As this hemizygous essential splice-site mutation may be an early driver of CTVT, its effect on the function of PTEN was further assessed through inspection of RNA sequencing (RNA-seq) data (see 'RNA sequencing' above). The alignment of RNA-seq reads to the start of PTEN exon 2 was examined in order to identify the actual starting sequence of this exon. By searching for the obtained starting sequence across the RNA-seq reads, a moderate number of reads featuring the ending sequence of PTEN exon 1 were retrieved; these likely corresponded to correctly spliced transcripts present in contaminating host dog cells. The identified sequence was found to be almost identical to the ending of human PTEN exon 1. Subsequently, this ending sequence of exon 1 was searched for across the RNA-seq data, in order to determine how this exon is spliced in CTVT transcripts. Three different cases of alternative splicing between exons 1 and 2 were found, all of which were out of frame, thus resulting in protein truncation; no spliced sequences involving other exons were identified. This was taken as evidence that the G>A variant at 26:37,853,147 is indeed an essential splice-site mutation, leading to protein truncation during translation of PTEN.
Interestingly, these three alternative splicing cases involved the end of exon 1 and positions slightly downstream of the start of exon 2, all of which were immediately preceded by an 'AT' dinucleotide. This suggests that, in the absence (due to the described G>A mutation) of the essential 'AG' sequence that is normally found immediately upstream of the exon start, the splicing machinery in CTVT recognises different downstream 'AT' dinucleotides as splicing sites.
Selection analysis: dN/dS model Evidence for selection was assessed by means of the widely applied dN/dS ratio (62,63). Global and gene-wise dN/dS estimates were obtained for the somatic SNVs found in different sets of genes using the dNdScv (17) v0.0.0.9 package in R (v3.3.3). Recurrent somatic SNVs (see 'Recurrent mutation analysis' above) were included in the somatic set as many times as their respective number of independent occurrences. Pairs of SNVs that were likely to correspond to a dinucleotide variant were removed. A database including the CDS sequences of coding genes in the CanFam3.1 reference genome was supplied to dNdScv in order to determine variant consequences.
Estimation of global and gene-wise dN/dS ratios was performed via the 'dndscv' function in the dNdScv package. This function uses a set of covariates to improve estimation of the background mutation rate for each gene (17). A matrix of covariates was constructed by remapping the reference covariates provided in the dNdScv package, 20 which consist of the first 20 principal components of 169 chromatin marks from the RoadMap Epigenomics Project (17,64).
As these covariates relate to the reference human genome, each gene in the covariate matrix was mapped to its orthologue in the CanFam3.1 reference genome, using the biomaRt (61) v2.30.0 package in R. In addition, available estimates of copy number and average gene expression per gene (see 'Copy number assignment' and 'Gene expression analysis' above) were added as additional covariates. The following options were specified in the 'dndscv' function, in order to avoid truncation of the sets of mutations and covariates: 'max_muts_per_gene_per_sample = Inf', 'max_coding_muts_per_sample = Inf', 'maxcovs = Inf'. CTVT being a clonal lineage, all SNVs were assigned to a single virtual 'sample'.
Selection analysis: Gene sets After obtaining global and gene-wise dN/dS estimates for the entire somatic variant set, the 'dndscv' function was run on specific sets of genes (using the 'gene_list' argument). Different sets of genes were defined for stratified analysis of global dN/dS: (i) five gene expression quintile groups, based on estimated average gene expression (see 'Gene expression analysis' above); (ii) three copy number groups, based on gene-wise copy number estimates (copy number states of 1, 2 and ≥3; see 'Copy number assignment' above); and (iii) two groups of reportedly essential genes (see below), according to whether their estimated copy number was 1 or >1. These gene groups, together with their respective global dN/dS estimates, are provided in Dataset S5.
A list of 2,028 essential genes was compiled by combining lists of essential genes reported in the literature (65)(66)(67)(68) and in the OGEE database (http://ogee.medgenius.info). For the former, the original authors' criteria were followed when selecting essential genes. For genes listed in the OGEE database, genes marked as 'essential' in any of the 18 available human datasets were selected. The following genes were subsequently removed from the compiled list: (i) genes marked as tumour suppressor ('TSG') or recessive ('Rec') in the COSMIC Cancer Gene Census (69) (http://cancer.sanger.ac.uk/census); (ii) genes from gene families observed to harbour recurrent alignment artefacts (genes OR51T1, VPS13D, VPS16, VPS18, VPS26A, VPS28, VPS29, VPS35, VPS36, VPS45, VPS52, VPS53, VPS54, VPS72); (iii) gene AGO2, which is commonly used as an RNAi effector in essentiality screens; and (iv) genes that were not somatically mutated in any CTVT tumour. Strong evidence of negative selection was found in this gene set ( Fig. 3 and Dataset S5).
In addition to the groups described above, 22 groups of genes were defined based on gene ontology (GO) annotations, obtained via the biomaRt (61) v2.30.0 package in R. To enhance statistical power and reduce multiple testing correction, groups were defined by selecting genes with GO terms including 22 selected keywords, which were chosen with a view to covering the set of hallmark features and processes of cancer (70). These keywords were defined via regular expressions as follows: 'repair', 'damage', 'transcription', 'translation|ribosom', 'mitochondri', 'proteolysis', 'metaboli', 'exosom', 'immun', 'adhesion', 'chromatin', 'inflammat', 'vesicle', 'proliferati', 'apopto', 'angiogenesis', 'chromosom', 'folding|folded', 'cytoskeleton', 'cell migration', 'cell cycle', 'phagosome|phagy'. Annotations from all three GO domains ('Biological Process', 'Cellular Component' and 'Molecular Function') were considered. After dN/dS estimation in each group, Bonferroni multiple-testing correction was applied to account for overlaps between the GO gene groups. The only statistically significant signal of negative selection after correction was observed for nonsense mutations in the group of genes with GO terms including the keyword 'repair' (the large majority of which are essential genes involved in DNA repair). Statistically significant evidence of positive selection was not found in any group.
Selection analysis: Simulated mutation sets Simulated sets of somatic mutations under neutrality conditions were generated by reproducing the number (148,031) and mutational spectrum (see 'Mutational signature analysis' above) of observed somatic SNVs. The variants in each of the 96 trinucleotide mutation types were assigned random positions (having the relevant trinucleotide context) along the genomic regions defined in the dNdScv reference CDS database for CanFam3.1 (see 'dN/dS model' above). Estimated dN/dS ratios in 1,000 such neutral simulations showed a distribution of values closely centred on dN/dS ≈ 1 (absence of selection) for both missense and nonsense mutations, thus confirming absence of a bias in the dNdScv regression model. Furthermore, adding different proportions of randomly selected germline SNPs to the simulated mutation sets resulted in a proportional downward shift in the distribution of estimated global dN/dS, whereas filtering of simulated mutations that collided with germline SNPs in any host resulted in an upward shift in the global dN/dS distribution, further supporting the validity of the model (Fig.  S11).

Table S3
Identified candidate early driver mutations in CTVT. Description and genomic coordinates of the candidate early driver mutations identified in this study. The candidate driver events affecting genes CDKN2A, MYC and SETD2 have been previously described (12,27). For details on the characterisation of the PTEN mutation, see 'Characterisation of hemizygous essential splice-site mutation in PTEN' in the Materials and Methods.