Genotyping sequence-resolved copy-number variation using pangenomes reveals paralog-specific global diversity and expression divergence of duplicated genes

Human pangenomes contain assemblies of non-reference copy-number variable (CNV) genes. We developed a new method, ctyper, to identify the copy-number of specific alleles of CNV genes cataloged in pangenomes with NGS datasets. Applying ctyper to the 1000-genomes samples revealed population stratification of paralogs and two classes of CNVs: recent CNVs due to ongoing duplications, and polymorphic CNVs from non-reference ancient paralogs. Expression quantitative trait locus analysis determined allele-specific expression within gene families, revealing that 7.94% of paralogs and 3.28% orthologs had significantly divergent expression. Case studies of individual genes include finding lower expression on SMN-1 copies that arose from conversion from SMN-2, and increased expression on a form of AMY2B that has undergone a translocation. Moreover, 4.7% of paralogs and 1.2% of orthologs had different most-expressed tissues. Furthermore, the genotypes explain more expression variance than known eQTL variants. Overall, ctyper enables biobank-scale genotyping of sequence-resolved CNVs.


Identifying alignment breakpoints on query sequences to construct pan-gene alignment graph
We then align each query sequence to the reference set to identify structural variant breakpoints.Sequences are mapped to the reference set, retaining alignments over 300 bases and 95% identity.The starting and ending points of alignments are recorded on each reference set sequence, and breakpoints are defined as clusters of alignment endpoints within 50 bps.The vertices of the graph are then defined by the sequences' endpoints and an enumeration of breakpoints.Edges are established based on the reference set segments between corresponding endpoints or breakpoints of the vertices.Consequently, each query sequence correspond to a path in the graph, which can then be used as a template to align the query sequence with high identity and no structural variation.We refer to this as a sequence path.
We construct a directed graph using all sequence paths, along with a start node (S) that connects the start points of all sequence paths, and an end node (E) that connects the end points of all sequence paths.This graph is referred to as the pan-gene alignment graph.

Linearization of the pan-gene alignment graph
First, in order to represent the repetitive sequences in the graph, if a sequence path contains multiple visits of a same node (cycle), we will create new nodes for the redundant visits.For example, the second occurrence of the breakpoint B1 on any query sequence will be renamed to B1 2 .
We then record all breakpoints and aim to sort them into a list that has the best overall agreement with the orders in all sequence paths.1. Starting from the start node S, we propagate to its most frequent unvisited successor across all sequence paths.
2. We keep propagating and record every new node we visit to the list at the location right after the last visited node, which is important when we re-visit the same node again.If all successors of a node have already been visited, the propagation moves to the end node E, if connected, or returns to the last visited node.
3. When we reach the end node, we re-propagate it to start node and repeat the cycle until all node has been visited.
The resulting list determines the order of linearization.Using this ordered list of nodes, we identify the corresponding edges (if any) and arrange the associated segments in the determined order.We then assemble these segments to create a template, referred to as the global linear template.

Pairwise global alignments to determine and visualize small variants
To enhance alignment quality, we re-align each query sequence to the pan-gene linear template.We extract the segments from each query sequence that mapped to the template and assembled them into what we refer to as a local linear template.Global pairwise alignments are then performed between each sequence and its corresponding local linear template.These alignments are subsequently transferred back to the pan-gene linear template.For segments of sequences that are not included to the local linear template, gaps are introduced.
We then assemble all pairwise alignments onto the pan-gene linear template, ensuring that sequence lengths were uniform at locations with insertions or deletions (indels).The result is a graphic multiple sequence alignment (MSA).
Variants within each sequence were readily identified and annotated based on the alignments.These variants can be visualized as a mutant plot, providing a clear and informative overview of genetic diversity, including both structural variations and small variants, across different sequences.

k -mer counting, normalization and bias correction in NGS data
During genotyping of NGS samples, we count all k -mers in each gene group's k -mer list.Specifically, we encode 31-mers as 64-bit integers for hashing purposes.We selected approximately one million random unique k -mers from exons of protein-coding genes to establish background sequencing coverage, which we refer as background k -mers.The average copy numbers of those background k -mers is determined as the k -mers coverage.We normalize all copy numbers of k -mers by this coverage to obtained the initial k -mers vectors of NGS samples.We found that certain genomic regions, such as the HLA-DRB gene-previously identified as the most problematic genes in other study 3 -exhibited strong batch effects.The bias overall increases the coverage of NGS data in this region and might potentially cause false positive calling of duplications and false negative of detecting deletions.
To address this, we provide a function for bias correction.We benchmarked individual k -mer in NGS data against long-read assemblies of the same sample, which served as the golden standard for true copy number determination.To identify k -mer with the lowest confidence, we calculated a posterior p-value for each k -mer' occurrence based on the Poisson distribution of ground-truth copy numbers in each NGS datasets.The k -mer with more than 15% of samples (6 out of 40) exhibiting a p-value below 0.05 were flagged.We assessed the average ratio discrepancies between ground-truth and observed copy numbers across 40 NGS samples, applying ratio corrections to the k -mer with low confidence.
The bias correction we performed largely reduced the Hardy Weinberg disequilibrium of HLA-DRB genotyping, from 34.3% (48 out of 140) to no Hardy-Weinberg disequilibrium.
After background normalization and bias correction, a k -mer vector V is obtained, denoting the most likely integer copy numbers of all k -mer in the original genome.
However, the underlying cause of this bias remains to be investigated, and we observed this bias in both cell-line and non-cell line NGS samples.

Detection of novel partial duplication and deletion below the scale of pangenome-alleles
To detect novel partial gene duplications and deletions that have not been previously observed in the pangenome cohorts and are of much smaller scale than pangenome alleles, we designed an auxiliary function in ctyper that utilizes k -mer coverages.
Recall that all pangenome alleles (PAs) are included in the graphic MSAs by the minimum sufficient sets of segments.We annotated the k -mer that locate at exclusive positions (i.e., k -mer found only in one segment and at a single position on that segment) and record these in our database.
After completing genotyping, we calculate the coverage ratio for each k -mer using the genotyping solution.The coverage ratio is defined as the ratio between the expected k -mer count from genotyping and the observed k -mer count in NGS data.We average the coverage ratios within each window on each segment (default window size = 60).This coverage can be further averaged over larger scales; we recommend using a scale of at least 3K.
These coverage patterns can also be visualized as a dot plot, which allows us to identify and visualize fragmented gene duplication and deletion events.
However, due to the rarity and difficulty of detecting novel events, benchmarking this results is challenging.It is also often difficult to distinguish real events from misassemblies, making comprehensive benchmarking infeasible.Nonetheless, it is estimated to reduce discrepancies the genotyping results from de novo assembly (CN1) by 7% for events larger than 3K.

Two step alignments for better lift-over of CNV genes
Our phylogenetic-based annotation and classification relied on sequence similarities among different pangenome alleles and were independent of their specific genomic locations.However, understanding these locations was important for distinguishing between paralogs and orthologs, especially in cases involving highly mutated sequences.
While the Human Pangenome Reference Consortium (HPRC) and the Human Genome Structural Variation Consortium (HGSVC) had previously conducted their own lift-over processes and analyses, these efforts may not be sufficient for accurate ortholog identification.The limitation was largely due to the fact that copy number variable genes often reside in segmental duplication regions, which are susceptible to structural variations and repetitive sequences.Alignments on these genes often result in fragmented alignments and multiple high-confidence alignments, presenting significant challenges for conventional sequence aligners.
To overcome the challenge, we developed a highly sensitive, multi-step alignment pipeline that allowed for the alignment and lift-over of sequences through a series of fragmented local alignments.
For the lift-over, we utilized the CHM13 T2T and GRCh38 genomes, including all annotated alternative loci, as templates.We constructed a BLAST database encompassing all these references.
Each lifted sequence was extended by 10,000 base pairs (bps) upstream and downstream to serve as anchor sequences.These sequences, along with the anchors, were then aligned using the local alignment tool BLASTn 1 , selecting alignments with over 90% similarity and lengths exceeding 500 bps.
We first merged alignments located on the same reference chromosome within a 10,000 bp range, forming what we refer to as alignment clusters.For each cluster, we created a pseudo-global alignment.For each query region, we selected the local alignment with the highest local alignment score (giving priority to higher-ranked alignments in case of score ties) and concatenated these to form a pseudoglobal alignment.This approach provides a spliced global alignment between the query and the loci corresponding to each alignment cluster, optimizing the total alignment score.
To calculate a global alignment score for each locus within an alignment cluster, we assigned a score of +1 for each matched base and -4 for each mismatched base.Gaps in the query were penalized with a score of -3, plus an additional penalty equal to the gap length.The global score was the sum of these individual scores.Loci with a score greater than zero were considered valid.Queries without valid loci were considered unmapped, and for those with multiple valid loci, the one with the highest score was selected as the most likely location.
However, local tandem duplications and translocations on the query or mapped locus can obscure the precise location of the query.To address this challenge, we performed an additional pairwise global alignment using Stretcher 2 between the query and the mapped locus.This approach allowed us to more accurately pinpoint and annotate the query's location based on the results of the Stretcher alignment.

Structural variants (SV) calling based on global pairwise alignment
Annotation of SVs on PAs is important to understand the genetic architecture and global diversity.
Using pairwise global alignments, we called SVs on PAs relative the their corresponding reference genes (most similar reference genes if > 90%, else are gene overlap with the lift-over location).Three types of SVs are called, including: deletion, insertion and duplication.
Based on pairwise global alignment, we searched for the gaps larger than 20bps on both query and reference sequences.To avoid large gaps were scattered by small fragments within it, we also merged all gaps within less than 20 bps in distance.
The gaps on query sequences will be annotated as deletions and gaps on reference sequences will be annotated as insertions.If more than 50% of the inserted sequence can be aligned to elsewhere on the reference sequences itself, it will be re-annotated as duplication.
We only annotated SVs with more than 300bps unmasked bases.

Determining the optimal transcripts based on exon alignments
Aligning transcripts to each pangenome-allele is essential for annotating diverged paralogs that differ from reference genes in both genomic location and sequence.This annotation is also important for detecting local duplication, distinguishing full gene duplications from exon expansions, and identifying pseudogenes with incomplete transcripts.
We first align all reference exons within the gene group to the pangenome-allele, using only exons with more than 90% similarity to the reference exons.We then assemble the aligned exons into transcripts.
We consider a transcript to be validly assembled if: 1.It contains no redundant exons, so any duplicated exons are treated as new units.
2. It maintains the correct exon order, with no rearrangements.
3. Its overall alignment score equals the total alignment score of all its exons, with a penalty of -100.0 for each missing exon.
Using a greedy strategy, we independently find the highest possible alignment score for each involved transcript in each iteration.
To find the highest possible alignment score, we applied a dynamic programming scheme to determine the global optimal choice for each exon: 1. We let E i,j denote the decision where, at the i th exon on the whole contig, with the last exon on the transcript is its j th exon in the reference.
2. We can conclude this relationship between solutions: mean distance from v to mean of group B, σ B := G B − v The distance from A i to mean of group A, σ Ai := G Ai − σ A , The distance from B i to mean of group B, σ Bi := G Bi − σ B , Let σ B := σ A + σ B , where σ B := σ B − σ A , it is the distance from group B to group A. Let x A = Ai∈A x i , and Because σ A , σ A1 ....σAn , σ B1 ....σ Bn , σ B correspond to independent phylogeny processes, they may be considered as close to orthogonal vectors and ⊥ to each other.Thus: By the Cauchy-Schwarz inequality, Ai ) has the minimum value when: For a sufficient sample set without bias and all alleles follow phylogenetic relationships, we might consider that 1 ∥σ B ∥ will coverage to a positive constant too as the sample size increases as σ B converges to a positive constant.On the other hand, ( Ai∈A 1 ∥σ A 1 ∥ ) diverges to positive infinite as the distance to the closest neighbor min Ai∈A A1 ∥ converges to 0. Hence, x A will converge to 1, the correct integer solution.
This means the model will have a good precision to detect known alleles close to the unknown allele close and have a increasing power with the increase of the data size.

Figure 1 .
Figure 1.Comprehensive case study on HPR and HP genes.Left: the mutant plot for all PAs on the HPR and HP locus based on graphic MSA; Proximal duplication and a SNP associated with proximal duplications are annotated (rs35184823, reported by GETx).Middle: the genotyping results based on 2504 unrelated 1kpg samples.Each pie chart represents the copy number distribution of each allele-type in each population.Right: the expression level for each allele-type estimated from the genotyping results of GETx samples and published GETx whole blood TPM, using linearmixed-model.The allele-types found in less than 5 samples are excluded.Co-expression has been found between HPR and HP (t = 24.089,p-value = 6.2e-78, pearson correlation).