A gene pathogenicity tool ‘GenePy’ identifies missed biallelic diagnoses in the 100,000 Genomes Project

The 100,000 Genomes Project (100KGP) diagnosed a quarter of recruited affected participants, but 26% of diagnoses were in genes not on the chosen gene panel(s); with many being de novo variants of high impact. However, assessing biallelic variants without a gene panel is challenging, due to the number of variants requiring scrutiny. We sought to identify potential missed biallelic diagnoses independent of the gene panel applied using GenePy - a whole gene pathogenicity metric. GenePy scores all variants called in a given individual, incorporating allele frequency, zygosity, and a user-defined deleterious metric (CADD v1.6 applied herein). GenePy then combines all variant scores for individual genes, generating an aggregate score per gene, per participant. We calculated GenePy scores for 2862 recessive disease genes in 78,216 individuals in 100KGP. For each gene, we ranked participant GenePy scores for that gene, and scrutinised affected individuals without a diagnosis whose scores ranked amongst the top-5 for each gene. We assessed these participants’ phenotypes for overlap with the disease gene associated phenotype for which they were highly ranked. Where phenotypes overlapped, we extracted rare variants in the gene of interest and applied phase, ClinVar and ACMG classification looking for putative causal biallelic variants. 3184 affected individuals without a molecular diagnosis had a top-5 ranked GenePy gene score and 682/3184 (21%) had phenotypes overlapping with one of the top-ranking genes. After removing 13 withdrawn participants, in 122/669 (18%) of the phenotype-matched cases, we identified a putative missed diagnosis in a top-ranked gene supported by phasing, ClinVar and ACMG classification. A further 334/669 (50%) of cases have a possible missed diagnosis but require functional validation. Applying GenePy at scale has identified potential diagnoses for 456/3183 (14%) of undiagnosed participants who had a top-5 ranked GenePy score in a recessive disease gene, whilst adding only 1.2 additional variants (per individual) for assessment.


Abstract
The 100,000 Genomes Project (100KGP) diagnosed a quarter of recruited affected participants, but 26% of diagnoses were in genes not on the chosen gene panel(s); with many being de novo variants of high impact. However, assessing biallelic variants without a gene panel is challenging, due to the number of variants requiring scrutiny. We sought to identify potential missed biallelic diagnoses independent of the gene panel applied using GenePy -a whole gene pathogenicity metric.
GenePy scores all variants called in a given individual, incorporating allele frequency, zygosity, and a user-defined deleterious metric (CADD v1.6 applied herein). GenePy then combines all variant scores for individual genes, generating an aggregate score per gene, per participant. We calculated GenePy scores for 2862 recessive disease genes in 78,216 individuals in 100KGP. For each gene, we ranked participant GenePy scores for that gene, and scrutinised affected individuals without a diagnosis whose scores ranked amongst the top-5 for each gene. We assessed these participants' phenotypes for overlap with the disease gene associated phenotype for which they were highly ranked. Where phenotypes overlapped, we extracted rare variants in the gene of interest and applied phase, ClinVar and ACMG classification looking for putative causal biallelic variants. 3184 affected individuals without a molecular diagnosis had a top-5 ranked GenePy gene score and 682/3184 (21%) had phenotypes overlapping with one of the top-ranking genes. After removing 13 withdrawn participants, in 122/669 (18%) of the phenotype-matched cases, we . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Introduction
The 100,000 Genomes Project was a UK government funded research project led by Genomics England (GEL) to sequence 100,000 whole genomes for families predominantly presenting with rare disease. 1 The project utilised a phenotype to genotype approach, whereby genome sequencing data were filtered using a pre-selected PanelApp 2 gene panel or panels chosen by Genomics England based on the Human Phenotype Ontology (HPO) 3 terms recorded at recruitment. 1; 4 The project was completed in 2020 and yielded an overall diagnostic rate of ~25% across all rare disease categories. 1; 5 However, as ever-increasing numbers of researchers gained access to anonymised whole genome sequencing data from the 100,000 Genomes Project, additional diagnoses were made using methods that extended variant analysis beyond gene panels across more coding and non-coding regions, which have subsequently been returned to participants. 4 As of 2022, 26% of all diagnoses returned by the 100,000 Genomes Project were from diagnoses not on the pre-selected gene panel applied, with many being pathogenic de novo coding variants. 4; 5 However, assessing other variants such as biallelic variants is more burdensome, particularly without the use of gene panels due to the sheer . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 30, 2023. ; number of variants that require scrutiny. This is because many are inherited from unaffected relatives and are carried at non-trivial allele frequencies in population databases. Furthermore, biallelic variation is often hard to interpret especially for compound heterozygotes where one variant may be pathogenic, and another may be a copy number variant, non-coding variant, or other variant of uncertain significance. This is where gene panels show their greatest utility since they can help narrow down variants to clinically relevant genes. 2 However, this approach must be balanced against the potential of missing diagnoses outside of the original gene panel applied.
We sought to identify potential missed biallelic diagnoses in recessive disease genes independently of the gene panel applied using a whole genome pathogenicity metric called GenePy, pronounced "Jenni-pea (d͡ ʒˈɛnɪpˌiː)". GenePy (https://github.com/UoS-HGIG/GenePy-1.3) is a gene pathogenicity prioritization tool developed at the University of Southampton that transforms the interpretation of next generation sequencing data from the variant level to the gene or pathway level. 6 GenePy incorporates allele frequency, individual zygosity (where a heterozygote scores one point and a homozygote scores two points), and a user-defined deleterious metric (such as CADD 7 ) into a single variant score.
GenePy is defined as: [Where h = individual; g = gene; k = variants; i = locus; Di = allele deleteriousness; fi = allele frequency; fi1 = allele 1; fi1 = allele 2] . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 30, 2023. ; GenePy then aggregates variant scores across genes in an additive manner, generating a single score, per gene, per individual that is represented in a GenePy matrix table (Figure 1). However, for large genes and intronic regions there is a potential to accumulate noise from low scoring variants. To mitigate this, GenePy can be customised to filter variants with high in silico scores only e.g. CADD score above a particular threshold. Additionally, GenePy can be applied across any defined interval and variant scores do not have to be summed across genes, e.g. one may choose to sum variants across a particular biological pathway or genomic region. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 30, 2023. ; Upon generation of a GenePy matrix, GenePy scores can be compared across individuals in a cohort; GenePy scores are intuitive in that higher GenePy scores correlate with higher pathogenic variant burden such that individuals can be ranked for their score for any given gene, relative to all individuals with comparable input genomic data. GenePy scores are not easily compared between genes, without normalisation and adjustment for gene length. Even then, genes with alternative tolerance to dysfunctional variation are likely to exhibit very different GenePy score profiles. Instead, GenePy demonstrates the greatest utility when individual gene scores are compared across large numbers of individuals. Since GenePy is an additive score, individuals in large cohorts with the highest ranked GenePy scores will be enriched for biallelic disease. Given the potential for missed biallelic diagnoses in the 100,000 Genomes Project, we applied GenePy at scale in a panel-agnostic way to uplift diagnostic rates.

Access to 100,000 Genomes Project Data
Participants were recruited to the 100,000 Genomes Project with written consent. The full protocol is available here: https://doi.org/10.6084/m9.figshare.4530893.v7. Deidentified data from the project held are in the secure Genomics England Research Environment (RE).
We obtained access to 100,000 Genomes Project data following governance training and through membership of the 'Quantative Methods, Machine Learning, and Functional Genomics' Genomics England Clinical Interpretation Partnership. We had an approved Genomics England Project (RR359).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 30, 2023. ; https://doi.org/10.1101/2023.03.21.23287545 doi: medRxiv preprint In 2022, we accessed 78,216 whole genomes from affected and unaffected participants recruited to the 100,000 Genomes Project. We extracted participants' affection status and any HPO terms associated with participants' records. Using the package LabKey in R, we queried the 'GMC Exit Questionnaire' SQL table and extracted any diagnostic (likely pathogenic/pathogenic) variants returned to participants by the project.

Curating a list of recessive disease genes
To target our method towards potential missed biallelic diagnoses, we curated a list of 2862 recessive disease genes using the OMIM 8 database (downloaded in May 2022) and cross checked these findings with the GenCC database, whereby discrepancies in inheritance were examined more carefully. 9 We then generated a bed file of gene coordinates for GRCh38 using the UCSC Genome Browser. The full gene list is available in Supplementary A.

Application of GenePy
Within the Genomics England RE we applied GenePy v.1.3 (https://github.com/UoS-HGIG/GenePy-1.3) software to 78,216 participants in the 100,000 Genomes Project using CADD 7 v1.6 as our deleterious metric and the gnomAD v.2.1.1 and V3 10 databases as our reference for allele frequency. We selected variants with a minimum depth of 10, minimum GQ of 20, and mean GQ > 35 using vcftools. We applied a call-rate filter, whereby each variant had to be genotyped in at least 70% of the cohort. For downstream analysis, we only modelled and scored participant variants annotated as coding +/-8 base pairs (on any transcript) and with a CADD score ≥15. We specified CADD as our input metric because it scores the greatest variety . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 30, 2023. ; https://doi.org/10.1101/2023.03.21.23287545 doi: medRxiv preprint and number of variant types. We generated GenePy scores for 2862 recessive disease genes to create a matrix comprising GenePy scores for 2862 genes across 78,216 individuals. Of note, in addition to 'affected' participants, this cohort included many 'control' type individuals that represented unaffected parents of affected children and germline genomes of cancer patients.
For each of the 2862 recessive genes, we ranked every Genomic England participant's GenePy score relative to one another e.g. the person with the highest GenePy score for CFTR would be ranked 1, and the person with the lowest GenePy score in CFTR would be ranked 78,216. After ranking, we arbitrarily assessed only individuals who ranked amongst the top-5 GenePy score for each gene. If two individual had identical scores, we retained all participants with a rank of 5 or less. We then removed any individuals who were coded as unaffected and affected individuals with insufficient phenotypic data in the form of HPO terms recorded. We next separated affected cases into those with a confirmed diagnosis returned by the 100,000 Genomes Project and those with a negative result. If the participant had a diagnosis returned, we assessed whether the established diagnostic variant was in a top-5 ranked GenePy score (Figure 2).
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) For affected participants with a negative genome result, we extracted HPO terms from R Labkey and compared these HPO terms with the clinical features associated with the disease gene for which they scored in the top 5 rank. For example, if the participant had the HPO terms 'pancreatic insufficiency', 'failure to thrive' and 'recurrent chest infections' and they ranked third/3 rd for CFTR, we would compare their HPO terms with the clinical features of cystic fibrosis. This process was completed manually by a clinician who used clinical acumen, phenotypic descriptions and HPO terms listed in OMIM, and the clinical literature to help assess phenotype overlap. If the participant's HPO terms were consistent with those for a gene that the same participant was ranked in the top 5 GenePy scores for (e.g. the participant had pancreatic insufficiency and recurrent chest infections and was ranked 3 in CFTR), this was considered a potential missed diagnosis. If the disease-gene phenotype was unrelated to the participant's clinical phenotype but represented a gene in the ACMG 78 11 list or may represent an adult onset disease, this was considered a potential incidental finding. For these, we contacted the recruiting clinician to discuss the findings. If there was no correlation between the participant's HPO terms and the clinical phenotype for the implicated disease gene, this was considered to be lacking phenotypic overlap and excluded from further consideration.

Assessing potential missed diagnoses
When the participant's phenotype was overlapping with the disease gene for which the participant ranked in the top 5, we extracted all variants from the participant's variant call file with a CADD score ≥ 15. These variants were then prioritised by likelihood of being a missed biallelic diagnosis, taking into consideration variant phase where possible, ClinVar 12 status, and . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 30, 2023. ; https://doi.org/10.1101/2023.03.21.23287545 doi: medRxiv preprint variant curation by ACMG/AMP 13 guidelines (Figure 2). Variants that were prioritised as 'Top' priority were considered putative missed diagnoses and mostly represented homozygous likely pathogenic/pathogenic variants or likely pathogenic/pathogenic compound heterozygous variants.

Results
We applied GenePy to 2862 recessive disease genes in 78,216 cases recruited to the 100,000 Genomes Project. A summary of results is provided in Figure 3. For each gene we selected the In total, there were 3184 affected individuals who had a 'no diagnosis' genome report returned by the 100,000 Genomes Project who were ranked in the top 5 GenePy scores for the 2862 computed recessive disease genes. For these cases, we compared the participant's reported . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. For the 682 participants with a potential missed diagnosis, we extracted variants in their top 5ranked gene with a CADD score ≥ 15 directly from their variant call file. In total we extracted 847 unique variants. Following prioritisation (Figure 2), we identified 122 top priority, putative missed diagnoses supported by phase, ClinVar 12 classifications and ACMG/AMP guidelines. 13 262 individuals were assigned 'Middle' priority demonstrating supportive evidence for a potential missed diagnosis, whereby for many there was lack of phased data limiting diagnostic potential. 72 individuals had some, but weak evidence for a potential missed diagnosis for example due to one variant being non-coding on the matched annotation from ECBI and EMBL-EBI (MANE) 14 transcript and were assigned 'Low' priority. 229 cases were ruled as nondiagnostic, typically due to the variants being in cis, being non-coding on the MANE transcript, not segregating with affected and related individuals, and being common in the 100,000 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 30, 2023. ; https://doi.org/10.1101/2023.03.21.23287545 doi: medRxiv preprint Genomes call-set (Table 1). In 13 cases, no variants were extracted because the individual had withdrawn from the 100,000 Genomes Project.

Results of GenePy applied to 2862 autosomal recessive disease genes in 78,216 individuals.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 30, 2023. ;

Discussion
We applied a gene pathogenicity score, GenePy, to a cohort of 78,216 individuals recruited to the 100,000 Genomes Project. Utilising ranked individuals' GenePy scores for 2862 recessive disease genes, we identified outliers with the highest GenePy scores per gene. We selected individuals who ranked in the top 5 scores for each gene, with an expectation that these individuals may harbour missed biallelic diagnoses.
847 individuals with a top 5 ranked GenePy score had a diagnosis returned by the 100,000 Genomes Project. 599/847 (71%) of these individuals had a diagnosis in a top 5 ranked gene, demonstrating how GenePy was able to rapidly recover 71% of diagnoses, showing potential diagnostic utility for both known and novel disease genes. The remaining 248 cases had diagnoses in dominant genes, with 81 diagnoses being de novo and 161 being inherited from an affected individual or the individual represented a singleton.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. interest in a disease gene consistent with the participant's phenotype with some supportive evidence for pathogenicity, but often phase could not be determined due to missing parental data. Additionally, for many of these cases, the variants contributing to the high GenePy scores were classified as VUSs and therefore require additional functional work-up. These variants are being reviewed by a clinical scientist in an NHS accredited diagnostic laboratory. Whilst followup of these variants is outside the scope of this research project, many of these variants, even those prioritised in the low category, may represent pathogenic variants. For example, noncoding variants were assigned to a lower priority grouping, despite them having a CADD score ≥ 15. It is hoped that many of these variants may be functionally investigated in the future as high-throughput methods to model VUSs advance.
In total, GenePy has identified potential missed diagnoses in 456/2864 (16%) of undiagnosed individuals who had a top-5 ranked GenePy score in a recessive disease gene. On average this resulted in the curation of 1.2 additional variants per participant. Therefore, the application of . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) GenePy successfully uplifted diagnosis rates without adding large variant numbers requiring time-consuming manual curation for diagnostic laboratories to assess and classify.
GenePy 6 is an open-source transferrable piece of software that can be successfully applied at scale. GenePy matrices can be used as reference datasets for other cohorts applying the same GenePy methods i.e. when applying the same deleterious metric, population reference database and quality control thresholds. For example, GenePy may be applied to a cohort of 10 samples, whereby these 10 individuals' GenePy scores could be ranked against a larger GenePy matrix comprising 100,000 individuals. However, GenePy matrices for genome sequencing data should only be compared with other genome sequencing datasets, unless restricted to the same target regions of exome data.

Limitations and opportunities
The application of GenePy to the 100,000 Genomes Project is not without its limitations. For one, we used an entirely arbitrary cut off of 5 when we ranked individuals. It is entirely possible that a more permissive value may capture a wider range of diagnoses; however, this must be balanced with the additional number of variants, per individual, that would require further scrutiny by clinical laboratories.
We assessed for phenotype overlap between the participants' HPO terms and the clinical features described for the disease-gene in which the participants ranked in the top 5 GenePy scores. For 320 cases, the HPO terms were so limited that it was not possible to assess overlap.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 30, 2023. ; This represents a real-world limitation of sequencing studies whereby there is often variability in how submitters record phenotype data and highlights the importance of accurate phenotyping. This phenotype comparison step was performed manually on 2864 cases.
Application of automated methods to compare participant HPO terms with disease gene phenotypes may prove more time efficient for GenePy applied at scale, however it is unlikely that clinical or diagnostic laboratories applying GenePy would be reviewing thousands of individuals at once, but rather on a case-by-case basis. Additionally, automated methods lack the clinical knowledge and experience of a clinician or clinical scientist that may be better able to intelligently compare groups of similar phenotypes.
In our application of GenePy we used CADD v.1.6 to capture and model the greatest breadth of variation in an unbiased way, but it may be that incorporation of other metrics for different variant types (e.g. REVEL 15 for missense) may prove more sophisticated in an improved model. However, this is likely to require machine learning to apportion in silico weightings fairly for different variant types. We also applied a CADD cut off of ≥ 15 to avoid individuals accruing high GenePy scores in genes of increasing length, where there was a higher chance of finding multiple ultrarare variants by pure chance that would score highly in GenePy. Whilst we are confident that using CADD ≥ 15 reduced a lot of noise and helped isolate pathogenic variants, we accept that this approach risks missing some pathogenic variants with lower CADD scores.
GenePy currently does not utilise phased data, meaning that some high scores may represent variants inherited in cis; indeed, we observed this in 71 cases (Table 1). However, we were . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 30, 2023. ; https://doi.org/10.1101/2023.03.21.23287545 doi: medRxiv preprint conscious not to limit GenePy to nuclear families with parental data since this does not represent a real-world example and would disadvantage non-parent/child families, where phase cannot be determined. In the future, this could perhaps be mitigated with long read sequencing data.
Whilst we applied GenePy herein focusing on identification of potential missed recessive disease, there may also be opportunities to apply it in autosomal dominant diseases. When we scrutinised the variants of individuals with potential missed diagnoses, we identified 61 individuals that ranked in the top 5 GenePy scores for a given gene, yet they only had one variant with a CADD score ≥ 15 in that gene. Most commonly these individuals harboured predicted loss-of-function variants which are upweighted in the GenePy statistic. Therefore, there may be utility of GenePy in haploinsufficient disease genes, but it is likely that a more stringent CADD cut off, such as ≥ 20, or limiting the GenePy statistic to the highest scoring variant is necessary to apportion lower GenePy scores to individuals who would otherwise accrue high scores from multiple rare, but benign variants with lower CADD scores.
GenePy also has potential to identify novel disease genes. If multiple top-ranking individuals across the same novel gene share similar clinical features, this may support the discovery of new disease genes. For novel haploinsufficient genes, unpublished data from our research group suggest that GenePy performs best when limited to high CADD scores e.g. CADD >20, whereas recessive genes may benefit from a more permissive CADD cut off.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 30, 2023. ; https://doi.org/10.1101/2023.03.21.23287545 doi: medRxiv preprint

Conclusion
The application of GenePy to ~78,000 individuals in the 100,000 Genomes Project has identified 122 putative missed biallelic diagnoses in known autosomal recessive disease genes that are being returned to participants through the Genomics England diagnostic discovery pathway.
Selecting the top 5 ranked individuals for 2864 autosomal recessive genes yielded review of only 1.2 additional variants per individual, rendering GenePy a useful tool to identify biallelic variants of interest without significantly burdening diagnostic laboratories with additional variants to assess. A dilemma for many diagnostic laboratories is how to limit number of variants requiring assessment without missing diagnoses. Whilst strategies to prioritise dominant diseases are well established e.g. de novo analysis or Exomiser 16 , there are limited tools for prioritising recessive conditions. We attest that GenePy is a useful panel-agnostic adjunct to exome and genome analysis pipelines to uplift diagnoses of recessive disease.