Investigation of the STR loci noise distributions of PowerSeq™ Auto System

Aim To characterize the noise and stutter distribution of 23 short tandem repeats (STRs) included in the PowerSeqTM Auto System. Methods Raw FASTQ files were analyzed using STRait Razor v2s to display alleles and coverage. The sequence noise was divided into several categories: noise at allele position, noise at -1 repeat position, and artifact. The average relative percentages of locus coverage for each noise, stutter, and allele were calculated from the samples used for this locus noise analysis. Results Stutter products could be routinely observed at the -2 repeat position, -1 repeat position, and +1 repeat position of alleles. Sequence noise at the allele position ranged from 10.22% to 28.81% of the total locus coverage. At the allele position, individual noise reads were relatively low. Conclusion The data indicate that noise generally will be low. In addition, the PowerSeqTM Auto System could capture nine flanking region single nucleotide polymorphisms (SNPs) that would not be observed by other current kits for massively parallel sequencing (MPS) of STRs.

Capillary electrophoresis (CE)-based technology has been the primary methodology to analyze short tandem repeats (STRs) in forensic DNA human identification testing for the past two decades. With the development of massively parallel sequencing (MPS), another viable platform is available for typing STRs. Several studies already have revealed the potential value of MPS for STR typing (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16). MPS technology enables characterization of a locus based on sequence instead of length-based differences as in CE, so it can increase the discrimination power of some STRs (2,9). MPS is able to detect repeat motif variation (RMV) within STRs, and single nucleotide polymorphisms (SNPs) residing within repeats and within the flanking regions.
However, use of MPS technology also brings different challenges. Sequence data likely will assist in resolving mixture evidence better than CE-based data. There are two types of noise that must be addressed in order to develop meaningful guidelines for mixture interpretation, especially for trace level contributors. First, stutter, ie, slippage events during PCR, is well-defined (17)(18)(19) and is inherent in STR typing. Typically, -1 and less often +1 stutter are observed in CE-generated data. Other stutter artifacts (eg, -2, -3, and so on) are not observed because the signal for these less frequently occurring artifacts are buried within the noise. With MPS multiple stutter products can be observed especially when read depth (or coverage) is exceedingly high. Second, there is sequence noise due to a low-level sequence substitution (SSE) and/or insertion/deletion error (IDE) rate. Such artifacts may exist with CE-based data as well but again often is not observed because it cannot be distinguished from background noise. However, MPS allows for detection of each molecule (or in actuality each molecular clone). These artifact features of STR typing with MPS must be described and defined per locus (20,21) to establish minimum thresholds and/or probabilities of events for an effective mixture interpretation protocol for MPS data.
Using data from Zeng et al (14), the noise and stutter distribution were characterized for 23 STRs included in PowerSeq TM Auto System (CSF1PO, D10S1248, D12S391, D13S317,  D16S539, D18S51, D19S433, D1S1656, D21S11, D22S1045,  D2S1338, D2S441, D3S1358, D5S818, D7S820, D8S1179,  DYS391, FGA, Penta D, Penta E, TH01, TPOX, and vWA) (Promega Corporation, Madison, WI, USA). While the sample size is small (as this is a preliminary study to describe the main artifacts), the results show that multiple stutter species can be observed. Moreover, sequence noise, which can be a reasonably large component of the total reads, is comprised of many different species such that the actual maximum sequence noise threshold per species for most STRs is very low. In addition to the artifact study, flanking SNPs were identified that would not be detected in other MPS commercial kits due to primer placement, emphasizing the point that population data for some STR haplotype variants will be kit-specific.

METHODS AND MATERIAL
The samples, extraction, PCR amplification, library preparation, and sequencing are described in Zeng et al (14).
Raw FASTQ files were exported from the MiSeq instrument and analyzed using STRaitRazor v2s (22) to display alleles and coverage. For each sample, only the homozygous loci and heterozygous loci with alleles at least four repeats difference in size were used in the noise data analysis, so that extended stutter products could be observed unequivocally. The sequence noise was divided into several categories as follows: noise at allele position, noise at -1 repeat position, and artifact. The artifacts include noise at positions<-2 repeat and>+1 repeat positions, and incomplete variants (ie, IDE) between -2 repeat and +1 repeat positions. For simplicity of presentation, stutter and sequence noise with the same nominal length of stutter were combined into stutter at -2 repeat and +1 repeat positions. For heterozygotes, the noise, stutter, and allele reads of two alleles were combined and treated as homozygotes. For each sample at each STR locus, the reads of sequence noise, stutter, and allele were divided by the locus coverage to obtain the relative percentages of locus coverage. Finally, the average relative percentages of locus coverage for each noise, stutter, and allele were calculated from the samples used for this locus noise analysis.

Flanking region and repeat region SNPs
Maximum haplotype coordinates for the 21 overlapping loci were identified between 3 MPS systems, ie, Promega PowerSeq™ Auto System, ForenSeq™ DNA Signature Prep Kit (Illumina, San Diego, CA, USA), and Precision ID Global-Filer TM Mixture ID panel (Thermo Fisher Scientific, South San Francisco, CA, USA), as well as Penta D and Penta E for a total of 23 STR loci. BED files were converted to HG19 and variants were identified within these regions using UCSC's Table Browser (23). Variant positions were further reduced to remove STRs and SNPs and insertion/deletions (InDel) with allele frequencies below 0.05 in all super populations of the 1000 Genomes Project (24,25).

RESULTS AND DISCUSSION
In this study, the noise distributions of 23 STRs of the Pow-erSeq TM Auto System were investigated. The STR locus gen-otype, locus coverage, and the number of samples used for each STR noise analysis are shown in Table 1. The number of samples used in the data analysis ranged between 2 and 11. For example, for the CSF1PO locus, seven samples could be used in the stutter and noise analysis; all were homozygotes.
The noise distributions of 23 STRs are shown in Table 2.
The D22S1045 and Penta D loci had relatively low noise   percentages (10.22% and 10.60%, respectively) at the allele position. But the D22S1045 locus had high levels of stutter at the -1 repeat (11.15%) and +1 repeat (5.56%) positions ( Figure 1). For the Penta D locus, stutter and noise were 1.60% and 0.20% of the total locus coverage at the -1 repeat position, respectively ( Figure 2). For -2 repeat and +1 repeat stutter positions, the percentages of reads were 0.10% and 0.26%, respectively, of the total locus coverage. However, the Penta D locus had the second highest level artifacts (7.17%, only lower than the D7S820 locus), such as IDEs. In contrast, the D2S1338 locus had the highest per-centage of noise (28.81%) of the multiplex at the allele position (Figure 3). At the -1 repeat stutter position, stutter and noise were 8.34% and 3.80%, respectively, of the total locus coverage. For -2 repeat and +1 repeat stutter positions, the percentages of reads were 1.06% and 0.12%, respectively.
The distribution of the sequence noise at the allele position of the D2S1338 locus is shown for one sample (No. 025; genotype 17, 23) to show the range of sequence noise variation and magnitude of any single species associated with an allele (Figure 4). Other than reads that were the attributed to the true allele (ie, the same sequence with the most abundant reads), there were 117 and 137 sequence noise species that were the same length as alleles 17 and 23, respectively. Although combined, the noise species were 12% and 15% of the total reads at this locus for sample 025, the highest individual sequence noise reads were 11X and 13X for allele 17 (418X) and allele 23 (414X), respectively. The majority of noise species had only 1X coverage. Most of the noise likely are due to SSE and are chemistry related. But the overall low level of individual noise species indicates that most SSE are low and thresholds may be set relatively low (ie, well below the total noise observed in this study of 28.81% for the D2S1338 locus). Alter-natively, since the noise may be characteristic of an allele at a locus, it may be possible to use the species distribution to resolve contributors of a mixture, making sequence data even more robust for mixture interpretation.
A total of 150 SNPs were identified in the maximum haplotype region of the 23 STR loci. The SNPs frequencies in five populations were downloaded from 1000 Genomes Project (24,25). SNPs with frequency ≥0.05 in at least one of five major populations (African, Ad Mixed American, East Asian,   Figure  4A -allele 17. Figure 4B -allele 23. European, and South Asian) were selected. Thirty-four SNPs (19 flanking region and 15 repeat region SNPs) were identified within the STR haplotypes included in the PowerSeq TM Auto System (Table 3). Compared with ForenSeq TM DNA Signature Prep Kit and Precision ID GlobalFiler Mixture ID panel (in-house data, data not shown), there were nine (seven unlinked) flanking region SNPs that were identified only in the PowerSeq STR amplicons. While detection of SNPs within repeats of alleles and corresponding stutter products may be affected by slippage, flanking region SNPs are far more stable and may be extremely useful for resolving stutter from minor or trace contributor alleles of the same nominal length in a mixture. Figure 5 demonstrates the possible application of flanking region SNPs. The genotype of sample No. 025 is 11, 12 at the D7S820 locus. The repeat motif of allele 11 is (TATC) 11 and the haplotype for allele 12 is (TATC) 12 with a flanking region SNP (rs7789995 T→A). Based on this flanking region SNP, the stutters from allele 12 are readily identified at the positions of allele 11 and stutter 10. In addition, the stutters caused by allele 11 also are detected at allele position 12 and stutter position 10.

Conclusion
In this study, the stutter, sequence noise distribution, and potential detection of additional flanking region SNPs of the 23 STRs included in PowerSeq TM Auto System were investigated. Stutter products could be observed readily at 2 repeats less and 1 repeat greater than the true allele. Total sequence noise at the allele position ranged from a low of 10.22% to a high of 28.81% of the total locus coverage. However, individual noise species were relatively low indicating that for most STRs noise likely will not have a substantial negative impact on mixture interpretation. Because of primer positioning, the PowerSeq TM Auto System could capture nine (seven unlinked) flanking region SNPs that would not be observed by both the ForenSeq TM DNA Signature Prep Kit and the Precision ID GlobalFiler TM Mixture ID panel. Thus, some STR haplotype allele variation will be multiplex specific.
Funding None.
Ethical approval received from the University of North Texas Health Science Center's Institutional Review Board.
Declaration of authorship XZ conducted the experiments, analyzed the data, and wrote the manuscript. JLK performed data analysis and edited manuscript. BB developed concepts and designed experiments, edited manuscript, and reviewed data.
Competing interests All authors have completed the Unified Competing Interest form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organization for the submitted work; no financial relationships with any organizations that might have an interest in the submitted work in the previous 3 years; no other relationships or activities that could appear to have influenced the submitted work.