Assessing HLA imputation accuracy in a West African population

The Human Leukocyte Antigen (HLA) region plays an important role in autoimmune and infectious diseases. HLA is a highly polymorphic region and thus difficult to impute. We therefore sought to evaluate HLA imputation accuracy, specifically in a West African population, since they are understudied and are known to harbor high genetic diversity. The study sets were selected from Gambian individuals within the Gambian Genome Variation Project (GGVP) Whole Genome Sequence datasets. Two different arrays, Illumina Omni 2.5 and Human Hereditary and Health in Africa (H3Africa), were assessed for the appropriateness of their markers, and these were used to test several imputation panels and tools. The reference panels were chosen from the 1000 Genomes dataset (1kg-All), 1000 Genomes African dataset (1kg-Afr), 1000 Genomes Gambian dataset (1kg-Gwd), H3Africa dataset and the HLA Multi-ethnic dataset. HLA-A, HLA-B and HLA-C alleles were imputed using HIBAG, SNP2HLA, CookHLA and Minimac4, and concordance rate was used as an assessment metric. Overall, the best performing tool was found to be HIBAG, with a concordance rate of 0.84, while the best performing reference panel was the H3Africa panel with a concordance rate of 0.62. Minimac4 (0.75) was shown to increase HLA-B allele imputation accuracy compared to HIBAG (0.71), SNP2HLA (0.51) and CookHLA (0.17). The H3Africa and Illumina Omni 2.5 array performances were comparable, showing that genotyping arrays have less influence on HLA imputation in West African populations. The findings show that using a larger population-specific reference panel and the HIBAG tool improves the accuracy of HLA imputation in West African populations.

was found to be most accurate, followed closely by HLA-A and lastly HLA-B (Table 2).

153
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 23, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023  (0.89) and CookHLA (0.21) tools were used. For SNP2HLA (0.66), the 1kg-All was the best 160 . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 23, 2023.

166
As HIBAG was the best performing imputation tool, HLA alleles imputed by HIBAG were 167 used for allele frequency and accuracy rate comparison and the output is plotted in Fig 2. HLA 168 imputation accuracy dropped when the frequency of HLA alleles increased across all the 169 reference panels, especially for the HLA-B alleles. This is comparable to a study by (19) Multiethnic . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in Overall, HLA-B alleles had higher error rates showing they were imputed less accurately.

178
CookHLA imputed HLA alleles with the highest error rates (Fig 3a). HLA-B seemed to have 179 higher error rates for SNP2HLA and HIBAG, while HLA-A alleles had higher error rates for 180 CookHLA and Minimac4.

181
An interesting observation was that Minimac4, a general imputation tool, imputed HLA-B 182 alleles more accurately than any of the HLA-specific imputation tools. showed that HLA-B had a higher error rate, followed by HLA-A and lastly, HLA-C. For 192 SNP2HLA (Fig 3c), HLA-B imputation was less accurate, followed by HLA-C and finally 193 HLA-A. HLA-A had higher error rates for Minimac4 (Fig 3d) and CookHLA (Fig 3a), . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 23, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 showed improved imputation accuracy of HLA-B alleles, suggesting that a general imputation 219 tool can be used for studies targeting HLA-B alleles.

220
Another important factor is the choice of genotyping array. The Illumina Omni 2.5 array 221 performance was slightly better than that of the H3Africa array because it has more SNPs in 222 the target population, 13,850 SNPs compared to 13,436 SNPs. This difference was, however, 223 statistically insignificant, showing that the choice of genotyping arrays has little influence on 224 HLA imputation accuracy. Note, however, that the two arrays have significant overlap in their 225 content, which may explain the similarities. Therefore, a comparison of more diverse arrays is 226 necessary to fully assess the impact of array content.

227
In (24), it was shown that genome-wide coverage of genotyping arrays correlates with the 228 number of SNPs on the genotyping arrays but does not correlate with the imputation quality.

242
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 23, 2023. ;https://doi.org/10.1101https://doi.org/10. /2023 Generally, the size of the reference panel has a substantial impact on HLA allele imputation 243 accuracy (25). As expected, increased accuracy was achieved with a more extensive reference 244 panel. To assess the accuracy of imputing HLA alleles using a larger reference panel, we 245 performed imputation on the large HLA Multi-ethnic reference panel via the Michigan 246 imputation server. Imputation accuracy was slightly higher than the other reference panels, but 247 we could not compare it with the other tools as the server only provides the Minimac4 tool.

248
Other than the reference panel sample size, population specificity also affected imputation 249 accuracy.

250
Overall, the H3Africa reference panel outperformed the other reference panels due to its larger 251 sample size and relatedness to the target population. It outperformed the other panels when 252 imputing using HIBAG and CookHLA, while the 1kg-All reference performed better when 253 imputing using SNP2HLA. This implied that HIBAG's and CookHLA's performance was 254 based on population specificity and sample size, while SNP2HLA's performance was based on 255 sample size alone.

256
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 23, 2023. ; https://doi.org/10.1101/2023.01.23.525129 doi: bioRxiv preprint

388
The study was supported by the National Institutes of Health [grant number U24HG006941].

389
The content is solely the authors' responsibility and does not necessarily represent the official 390 views of the National Institutes of Health. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 23, 2023. ; https://doi.org/10.1101/2023.01.23.525129 doi: bioRxiv preprint