China

November

NASTRA: Innovative Short Tandem Repeat Analysis through Cluster-Based Structure-Aware Algorithm in Nanopore Sequencing Data

Algorithm in Nanopore Sequencing Data

3 4

Zilin Ren

zilin.ren@outlook.com 2 3 4

Jiarong Zhang

0 1 3 4

Yixiang Zhang

2 3 4

Pingping Sun

2 3 4

Jiguo Xue

0 3 4

Jiangwei Yan

yanjw@sxmu.edu.cn 1 3 4

Ming Ni

niming@bmi.ac.cn 0 3 4 0 . Institute of Health Service and Transfusion Medi cine , Beijing 100850, People's Republic of 1 . School of Forensic Medicine, Shanxi Medical Univ ersity , Taiyuan 030001, People's Republic of 2 . School of Information Science and Technology, In stitution of Computational Biology , Northeast 3 Changchun 130122 , China 4 Normal University , Changchun 130117 , China

2023

5 2023 468 495

Short-tandem repeats (STRs) are type of genetic markers distinguishing individuals and authenticating cell-lines. Nanopore sequencing is promising in STR typing for its convenience, but lack of analysis method. Here we proposed NASTRA, a tool for accurate STR genotyping with nanopore sequencing, which uses an STR-structure-aware algorithm to infer repeat numbers of

China China

1. Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, STR motifs. In our real-time scenario tesing, NASTRA has 100% accuracies for diploid STRs, far 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 exceeding method employing strategy that include all candidate STR allele sequences for alignments. NASTRA could be useful in applications as individual identification and cell-line authentication with nanopore sequencing. NASTRA is available via https://github.com/renzilin/NASTRA.

Keywords: nanopore sequencing; short tandem repeat; individual identification; cell line authentication; bioinformatics tools.

Background

Short tandem repeats (STRs) in human genome are specific DNA loci that can have high degree of polymorphisms in the number and forms of repeat motifs among individuals. This feature makes STRs a kind of genetic markers well-suited for individual identification [1,2], cell line authentication [3,4], as well as relationship inferring [5–7]. In Short Tandem Repeat DNA Internet Database maintained by the National Institute of Standards and Technology (NIST) [8], more than 70 STR genetic markers, typically with 19 (D10S1248) -114 (FGA) alleles in human populations, are chosen as genetic markers with desirable features as high heterozygosity, regular repeat unit, distinguishable alleles and robust amplification. These STRs have been utilized for decades by forensic laboratories across the globe and have formed the basis of million- or even 10-million-scale human DNA databases for public security purposes in many countries. In the realm of biomedical research, profiling STR for cell line authentication plays a critical role in ensuring scientific reproducibility and preventing misidentification as well as cross-contamination of cell lines[9,10]. To establish standardized practices for cell line authentication using STRs, the International Cell Line Authentication Committee (ICLAC, https://iclac.org) integrates numerous 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 online databases and search tools, including the Cellosaurus knowledge resource [11], DSMZ STR profile database[12], CLIMA database [13], CLASTER [14].

Conventional method for STR genotyping is the use of capillary electrophoresis (CE)-based length analysis. Besides, massively parallel sequencing (MPS) has also been validated for STR profiling, and it provide a larger battery of STRs in a run than the CE-based method. However, the majority of instruments of Sanger sequencing and MPS are heavy (50-186 kg), costly (50k-180k US dollars), and have high environmental requirements as constant temperature and vibration free. Thus, Sanger sequencing and MPS systems are usually maintained by well-equipped laboratories or facilities, and samples are transported for analysis, which might needs days to obtain STR typing reports.

A more rapid and affordable solution of STRs profiling thus has significance for relevant research and applications. Nanopore sequencing provides a cheap and real-time sequencing approach. The devices for nanopore sequencing, such as MinION Mk1B/Mk1c (Oxford Nanopore Technologies, Oxford, UK) and the recently released QNome-3841 (Qitan Technology, Chengdu, China), are exceptionally portable and considerably cheaper than NGS and Sanger sequencing systems. Moreover, these portable nanopore sequencing devices are capable of generating much longer reads than Sanger sequencing and NGS systems, and had a throughput in the Gbp range during a single run. Therefore, nanopore sequencing has been applied in various on-site sequencing applications as in-field genomic surveillance of pathogens [15,16] and biodiversity investigations [17,18]. Due to the portability and high throughput, Zaaijer et al.[19] reported an interesting pilot study that utilized MinION for whole genome sequencing to rapidly profile SNPs and authenticate human cell lines for biological researches. 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84

However, despite these advantages, nanopore sequencing data are noisier than these generated by Sanger sequencing and MPS. Unlike the cases of STR expansion disorders causing diseases, in which pathogenic alleles remarkably differ from benign alleles in sizes [20–22], alleles of STR genetic markers for individual or cell-line identity identification are usually of a 100 – 300 nt length and their repeat units are of 3- to 5-nt. Genotypes of STR genetic markers are defined by the exact counts of their repeats. Many STR genetic markers consist of multiple (compound STRs and complex STRs) and units are similar within in one STR and its flanking regions. For instance, the repeat structure of a Combined DNA Index System (CODIS) core STR locus D3S1358 is TCTA[TCTG]n[TCTA]m, in which n and m denotes repeat numbers. Moreover, some repeat units of STRs contain or form 3- to 4-nt homopolymers or form homopolymers in repeats, as STR Penta D with a repeat structure of [AAAGA]m, which are particularly error-prone, systematically, in nanopore sequencing [23].

Thus, the improvement of highly accurate typing method for genetic STR markers for nanopore sequencing is challenging [24–27]. To tackle this issue , developing algorithms and workflows specifically tailored for typing of genetic STR markers, considering repeat structures of specific STR loci, might be feasible strategy. In our previous study [23], we exhibited that conventional genotyping tools for nanopore sequencing as repeatHMM [21] are not suitable, and a simple customized workflow that integrated the classic Smith-Waterman algorithm for repeat unit re-alignment provided a significant improvement in the accuracy of STR typing. More than 32 STRs could be consistently and correctly typed in a validation set involving 31 individuals, and it is a prove-of-concept study of employing nanopore sequencing in human identity testing and cell-line authentication. Recently, Hall et al. [27] and Tytgat et al. [24] and the very recently 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105

Jidong et al [28], used a different and straight strategy that includes all available sequences of alleles for alignment of nanopore sequencing reads, and to select the alleles with sequencing reads for STR genotyping. In their validation sets, they report a higher STR typing accuracies than our previous method, and it further indicates the promising application of nanopore sequencing in this field. However, this strategy highly depends on the available sequences, many of which are unknown for these highly diverse STRs and might lead to error typing. On the other side, larger validation sample sets are also needed to example validate the STR typing methods. We aim to develop a method for accurate STR genotyping in applications as individual identification and cell-line authentication. We expect that the method can fully consider the characteristics of these STR markers and have high flexibility, i.e., independent of allele sequences databases that lack allele information. In this study, we propose NASTRA, a tool with clustering-based structure-aware algorithm , rather than allele-dependent alignment, for inference of repeat structure for STR genotyping. We includes a validation set involving 76 individuals and 8 cell-lines, and tested with different chemistries (R9 and R10 flow cells) as well as STR amplification methods prior to sequencing. Meanwhile, the validation based on allele-dependent methods were conducted and NASTRA turned out to have better performance in both accuracy and sequencing data requirement.

Results

In this study, we firstly illustrated the overview of NASTRA. And then demonstrated how to find the optimal thresholds: supporting reads number (SN) and the ratio of supporting reads number (SNR), which is pivotal for STR genotype calling. Following this, we conducted an extensive 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 evaluation of NASTRA, comparing it with a typical alignment-based method named STRspy and investigating the characteristics of genotyping errors. We then conducted a real-world scenario testing using eight human cell lines. Lastly, we summarized the runtime of NASTRA. Throughout these processes, we leveraged the MiSeq FGx™ Forensic Genomics System to carry out STR genotype calling for ground truth construction. Specifically, we amplified 54 STR loci and 46 STR loci across 88 DNA samples using Verogen’s Foren Seq Mix A kit and Promega’s PowerSeq 46GY kit and sequenced amplicons with R9.4.1 and R10.3 flowc ells (Figure 1a).

Overview of NASTRA

For STR genotyping, NASTRA requires BAM files, BED files containing the positional information of STRs, and fact sheets containing the repetitive structure details of the STRs. In addition, NASTRA performs quality control and determines homozygosity/heterozygosity on genotyped alleles based on the SN of alleles and the SNR between different alleles. Drawing inspiration from the ForenSeq Universal Analysis Software (UAS), we incorporated a similar mechanism to filter loci with low coverage and stutters: if the total number of reads covering the STR locus is below 10, NASTRA doesn’t output any result due to insufficient coverage; if the SN value of the major allele does not exceed the threshold, the result is classified as "interpretation"; if the SN value of the minor allele does not surpass the threshold but the SNR value does, the result is considered as <imbalance=. Regarding the optimal selection of SN and SNR, we provided the corresponding configuration file for users (the relevant analysis is described in Parameter optimization section). NASTRA comprise two main sub-algorithms, read clustering and repeat structure inference, which mitigates the potential impact of subtle sequencing errors on accurate genotyping and 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 genotypes STR without the need for allele reference database (Figure 1b). In step 1, NASTRA retrieves aligned reads that span a designated STR locus based on positional information from the BAM file. Following that, in step 2, the prefix and suffix flanking sequences are individually aligned against the extracted reads, employing an affine-gap penalty (methods section). Subsequently, the upstream and downstream sequences of all retrieved reads will be trimmed, retaining only the three adjacent bases flanking the core repeat region of the STR. These trimmed reads are categorized as unique trimmed reads and their frequencies are recorded as SN. Due to the presence of subtle sequencing errors, the trimmed reads could encompass over 100 distinct types. Therefore, NASTRA utilizes pairwise alignment with an affine gap penalty to cluster these reads in step 3. By permitting the exclusion of minor errors, such as single-base insertions, this approach decreases the diversity among reads, thereby facilitating the identification of potential candidate alleles. In step 4, we developed a recursive algorithm to infer repeat structure of allele sequences based on the repeat units present within the STR, which ensures swift acquisition of STR genotypes and aids in promptly identifying the locations of SNV in locus. Finally, NASTRA conducts STR genotyping by leveraging the SN and SNR values associated with alleles.

Parameter optimization

NASTRA performs STR genotype calling based on the SN and SNR values of the alleles. Therefore, the threshold SN and SNR need to be refined to enhance the robustness of NASTRA. In light of this, we partitioned the samples into two distinct sets: (1) training set, comprising 20 males and 20 females, and (2) test set, consisting of 18 males, 18 females, and 4 replicates of 2800M samples. However, two underlying issues arose that could prevent the attainment of optimal threshold values: (1) the limited size of the training set; (2) the insensitivity of the existing 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 sample's deep sequencing depths towards the thresholds. To address these issues, we developed a mini-tool called NanoTime. Leveraging sequencing summary files, we used NanoTime to downsample Nanopore sequencing reads in accordance with sequencing durations, including 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, and 24 hours. This process substantially expanded both training set and testing set from 40 samples to 480 samples.

To assess the performance, we used allele calls from UAS reports as the ground truth and excluded results from the UAS that were labeled as <imbalanced,= <allele count,= <stutter,= and/or <interpretation threshold=. Then, NASTRA results that failed quality control were excluded from the evaluation. Finally, the genotyping results for each STR obtained through NASTRA can be classified into five classes: (1) Exact Match: correct genotyping; (2) Incomplete Match: correct genotyping when disregarding the incomplete repeat unit; (3) One Match: one of the alleles is correctly genotyped in a heterozygous STR genotype; (4) Incomplete One Match: one of the alleles is correctly genotyped in a heterozygous STR genotype, when disregarding the incomplete repeat unit; (5) Mismatch: incorrect genotyping.

Thresholds for Autosomal STRs. A total of 12960 genotype calls for 27 autosomal STR loci on 480 samples were obtained. Initially, we explored various SN values (0, 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50) as well as different SNR values (ranging from 0 to 1, in increments of 0.01). For each SN value, we calculated the maximum genotyping accuracy for each autosomal STR locus from expanded training set. The result showed that when SN equals or surpasses 25, the maximum genotyping accuracy for all STR loci can surpass 95% (Figure 2a). After determining an SN value of 25, we calculated the genotyping accuracy for each autosomal STR at various SNR values. The median value within the SNR range corresponding to the peak genotyping accuracy is designated 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 as the SNR threshold (Figure 2b). After the determination of the threshold SN and SNR, NASTRA was employed to conduct STR genotype calling on expanded testing set. In summary, 9,428 of 12,960 (72.8%) STRs on 480 expanded testing set were genotyped successfully, consisting of 8,455 (89.7%) exact match, 642 (6.8%) incomplete match, 307 (3.3%) one match, 11 (0.1%) incomplete match, and 0.1% mismatch. The rest 3,532 genotypes calling failed to pass NASTRA quality control (2,548 calls) or UAS quality control (984 calls), due to the low coverage of short sequencing durations. Figure 3c illustrates the accuracy of each STR across the 480 samples. Among the 27 loci, 18 loci exhibited a calling rate of 1 (either exact match or incomplete match), while 4 loci exceeded 95% accuracy. Additionally, we employed pairwise alignment to investigate sequence identity for each genotyping result (Figure 2d). We found that for the majority of "incomplete match" genotyping results, the sequence identities of allele sequences equal to 1. This suggests that NASTRA primarily disregards the incomplete repeat units solely in terms of repeat counting, while maintaining the true allele sequences.

Evaluation of NASTRA on ForenSeq amplicons with R9.4.1 flow cells After having determined the optimal parameters for NASTRA, we first assessed the performance between NASTRA and the typical alignment-based method, STRspy [27], on ForenSeq amplicon sequencing data (test set, sequencing duration 24 hours). STRspy is an STR profiling tool tailored for long-read sequencing, which is based on the alignment of sequencing reads with an allele reference library. The authors have created a database containing allele sequences of 22 autosomal STRs. And thus, STRspy is unable to perform genotyping calling for 5 out of 27 STR loci in ForenSeq data because of the absence of allele information. The results in a calling of 0 for 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 these loci (D17S1301, D19S433, D20S482, D4S2408, D9S1122, and D6S1043) across all samples. Furthermore, it’s crucial to emphasize that if the relevant allele sequence isn’t included in the reference database, STRspy may produce incorrect genotyping results due to wrong alignments. We analyzed the calling ratio and genotyping accuracy for ForenSeq data across the test set (Figure 3a-b). In addition, the genotyping results of sexual STRs were summarized. Regarding NASTRA, it achieved a calling ratio of 1 for 7 STR loci, while for 9 STR loci, the calling ratio ranged from 92.1% to 97.4%. Additionally, 3 STR loci had a calling ratio ranging from 71.8% to 80.0%. The calling ratios for PentaD and PentaE are the lowest, at 22.2% and 33.3% respectively. This is attributed to homopolymer sequencing errors causing low SN of candidate alleles. STRspy successfully genotyped for 22 STR loci, achieving a calling ratio of 1. NASTRA demonstrates superior accuracy compared to STRspy. NASTRA achieved a 100% genotyping accuracy across 21 STR loci, including PentaE. Among the remaining 6 loci, the genotyping accuracy for 3 loci (D2S1338, D1S1656, and D6S1043) exceeded 92%. As for STRspy, it exhibited 16 loci with a genotyping accuracy beyond 92%, including 12 loci with a accuracy of 100%. Evaluation of NASTRA on PowerSeq amplicons with R10.3 flow cells To investigate the performance of NASTRA on different amplification systems and flow cells, we conducted a sequencing experiment on 46 samples using Promega’s PowerSeq 46GY kit and R10.3 sequencing flow cells. The PowerSeq kit is a co-amplification system consisting of 22 autosomal STR loci and 23 Y-STR loci. The ONT R10.3 flow cell represents a new version featuring new pore design, resulting in enhanced read accuracy and quality. We conducted STR genotyping 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 on the sequencing data using STRspy v1.1.1 and NASTRA with default parameters. The calling ratio and genotyping accuracy for PowerSeq data across 46 samples were shown in Figure 4a and b. Compared to the ForenSeq amplification system, the PowerSeq system does not include five automsomal STR loci (D17S1301, D20S482, D6S1043, and D9S1122). Regarding NASTRA, out of the 22 autosomal STR loci, it successfully called 20 STR loci with a calling ratio of 1, while the remaining 2 loci, PentaE (94.7%) and PentaD (55.6%), exhibited slightly lower calling ratios. For STRspy, there is only one locus, D19S433, not being included in the reference database. In terms of genotyping accuracy, we observed a decrease in performance for both STRspy and NASTRA on the PowerSeq dataset, which could be attributed to the overall shorter amplicons in PowerSeq data compared to ForenSeq data. Specifically, NASTRA achieved an accuracy exceeding 95% in 18 loci (compared to STRspy's 13 loci), with the remaining 4 loci having accuracy ranging from 80% to 91.5%. While, STRspy achieved accuracy between 61.7% to 91.3% in the remaining 7 loci, and there was no result for one locus, D19S433. In general, NASTRA exhibited a slightly better performance than STRspy.

Real-time typing of 8 human cell lines To assess NASTRA's performance in real-world scenarios, DNA samples from eight human cell lines were processed for amplification and sequencing using the ForenSeq Signature kit on a MinION platform with an R9.4.1 flow cell. We then employed NanoTime to generate real-time sequencing data at various durations (1-6, 8, 10, 12, 16, 20, and 24 hours), and utilized NASTRA for the genotype calling of 27 autosomal STRs. The MinION platform yielded amplicon sequencing data covering 27 autosomal STRs, 24 Y-STRs, 7 X-STRs, 94 identity SNPs, which 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 consumed a certain portion of sequencing throughput.

The overview of genotyping results is shown in Figure 5a. NASTRA consistently returned correct genotyping results for all STR loci, except those that failed to pass UAS quality control and were recognized as unreliable results (labeled with Imbalance/Interpretation). When the sequencing duration is 1 hour, there are 8 of 27 loci were genotyped successfully and correctly across 8 samples. With the sequencing duration reaching 4 hours, at least 7 of 8 samples of all loci belonging to the CODIS Core Loci were genotyped correctly. We found that the occurrence of unreliable genotyping results was related to low sequencing depth (Figure 5 b-c). And when the depth of each locus is higher than 300, the number of unreliable genotyping results is very few. This may be crucial for the development of STR multiplex panels tailored for nanopore sequencing.

Comparison of running time NASTRA is built on the Python programming language and incorporates the parasail library [29], a C-based pairwise alignment tool. Users can utilize the shell command xargs to achieve parallelization of NASTRA at the sample-level, while STRspy utilizes the shell tool parallel for parallelization at the locus-level.

To make a runtime comparison, we allocated 8 threads to NASTRA, aligning with the thread count employed by STRspy. The STR genotype calling for all sequencing data was performed on a server with a 64-Core/128-Thread Processor (AMD EPYC 7763, 2.45GHz) and 256 GB of memory. In Table 1, we summarized the runtime comparison between NASTRA and STRspy. We found that NASTRA is about 60X faster than STRspy on the ForenSeq amplicon sequencing data, while on the 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276

PowerSeq amplicon sequencing data, STRspy took longer. This could potentially be attributed to the fact that the ForenSeq data encompasses both 58 STRs and 94 SNPs, whereas the PowerSeq data only includes 46 autosomal and Y-chromosomal STRs. Consequently, the number of reads processed from PowerSeq data is higher than those processed from ForenSeq data. Notably, when we conducted a real-time typing of 8 human cell-lines, NASTRA demonstrated its suitability for handling such tasks.

While it's true that the two tools employ different programming languages, and thus, a direct comparison of runtime might not be entirely objective, it is evident that NASTRA significantly outperforms STRspy in terms of runtime. Based on our observations, we noted that a significant portion of time is occupied by the alignment process during the execution of STRspy, potentially attributed to the individual execution of Minimap2 [30,31] for each locus.

Disscussion

In this study, we developed a novel computational approach for STR genotyping, named NASTRA. NASTRA comprises two key algorithms: read clustering and repeat structure inference. Recognizing that sequencing errors can occur randomly occurs in each amplicon read, NASTRA utilizes read clustering to mitigate the impact of these errors and found the candidate allele sequences of the STR. Once the candidate allele sequences are obtained, structure-aware algorithm will reconstruct the repeat structures of alleles based on the known repeat motifs from fact sheets. This process transforms the allele sequences into bracket sequences, where incomplete units or SNPs are readily discernible. Notably, in contrast to alignment-based methods, NASTRA operates independently of allele reference database and thus won’t fail to 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 genotype STR resulting from the lack of allele information in the database. Additionally, we have applied NASTRA on MiSeq FGx data for ground truth construction, demonstrating it can also be used for NGS data.

To comprehensively evaluate the performance of NASTRA, we carried out several assessment procedures. We first created a ground truth dataset of 88 DNA samples using the MiSeq FGx system, which is a widely accepted technology for forensic applications. Then, amplification and sequencing experiments were carried out across these samples using various amplification kits and flow cells. To investigate the optimal thresholds for NASTRA and its robustness under different sequencing durations, we expanded the training and test sets using NanoTime. Our performance evaluation encompassed ForenSeq data, PowerSeq data, and a real-world test involving eight standard DNA cell lines. For performance comparison, we exclusively utilized the alignment-based STRspy and did not employ well-known repeat quantification tools such as repeatHMM [21], STRique [22], and DeepRepeat [20], as they are primarily tailored for detecting repeat expansion disorders rather than forensic-level STR genotyping. As demonstrated in our previous study [23], repeatHMM didn’t perform well. However, DeepRepeat presents an intriguing possibility for utilizing current signals in STR genotyping. Exploring transfer learning with our data could be a promising avenue. In summary, the results demonstrated that NASTRA exhibited exceptional performance in terms of accuracy and speed, making it a promising method for real-world applications.

However, NASTRA does have several limitations. Firstly, it's worth noting that our performance comparison was conducted using only one alignment-based method, and the reference database for this method is still undergoing updates. Therefore, this evaluation may not provide a fully 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 objective assessment but rather serves as an illustrative example for reference in terms of accuracy and runtime. Secondly, NASTRA is unable to genotype incomplete repeats. While we have demonstrated that incomplete repeat units can be detected in the output allele, NASTRA does not incorporate the digital information into its allele calls. Thirdly, NASTRA’s read clustering process carries the potential risk of erroneously classifying two alleles as originating from a single allele. This may occur due to the tolerance of a single nucleotide polymorphism (SNP) or a one-base gap when dealing with two similar alleles (differing by just one-base). Fourthly, we have observed that the length of amplicons can impact performance based on our testing with ForenSeq and PowerSeq data. Specifically, the flanking regions of these STRs, including their prefix and suffix, can be quite similar, potentially leading to incorrect trimming. NASTRA requires amplicons with long flanking regions (more than 30 bp) to ensure the accuracy of STR genotyping. Finally, NASTRA relies on base sequences as input, and its performance is therefore dependent on the base calling model. Given that STRs are a type of complex genomic region, the development of an effective base calling model for STRs is necessary.

Conclusions

NASTRA is an innovative computational approach for STR genotyping using nanopore sequencing data, employing a structure-aware algorithm instead of traditional alignment to a reference database. Through tests and comparisons conducted on various flow cells and amplification kits, NASTRA demonstrated good performance in terms of genotyping accuracy and speed. Although not all STR loci obtained correct genotype calls, NASTRA exhibited robustness in certain specific loci, demonstrating its significant potential for human identification and cell line authentication 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 using nanopore sequencing platforms. Further improvements of NASTRA through validation studies with larger sample sizes are necessary.

Methods

Sample collection and DNA extraction A total of 76 blood samples were collected from 76 unrelated anonymized Han Chinese volunteers, including 38 females (ID: F01-F38) and 38 males (ID: M01-M38). Among them, 30 samples (F01-F15 and M01-M15) were used in our pervious study [23]. The rest 46 samples (F16-F38 and M16-F38) were newly collected. Genomic DNA of newly collected 46 samples were extracted using PureLink genomic DNA kit (Invitrogen, USA). In order to perform a real-time scenario tesing, we collected 8 control DNA samples including 3 forensic DNA controls (2800M from Promega, 9947A and 9948 from Origene), and 5 cell line standards (NA12878, NA24143, NA24149, NA24694, and NA24695 from Coriell Institute). All DNA samples were quantified on the Qubit 3.0 Fluorometer (Invitrogen) before amplification.

STR profiling using MiSeq FGx system Newly collected DNA samples (F16-F30, M16-M30) and 8 control DNA samples were amplified using the ForenSeq DNA Signature Prep Kit (DNA Primer Mix A). In the meantime, 2800M were amplified as positive control using DNA Primer Mix B, which covers all STR loci in Mix A kit. To obtain sufficient amplicons for Illumina MiSeq FGx sequencing and Nanopore sequencing, 50 ng of template rather than recommended 1 ng were used as input. PCR amplification were performed on ProFlex™ 3x32-Well PCR System (Thermo Fisher). 347

analytical threshold (1.5%), interpretation thresho ld (4.5%) and stutter filter (specific to each 348

locus). 341 342 343 344 345 346 349 350 351 352 353 354 355 356 357 358 359 360

All purified amplified libraries were divided into 2 shares, which were used for both MiSeq FGx sequencing and Nanopore sequencing. Pooled libraries were obtained combining equal volumes (2 μL) of normalized library (10 nmol) and diluted into 4 nmol. Finally, 7 pmol of pooled libraries were sequenced on the MiSeq FGx desktop sequencer. The above experimental steps were carried out according to the manufacturer’s protocol. Genotype calling of STR loci was performed using ForenSeq Universal Analysis Software, UAS (v1.3.6897; Verogen), with Verogen’s default Nanopore sequencing for ForenSeq amplicons The concentration of PCR products was quantified with Qubit 3.0 fluorometer and Qubit dsDNA HS assay kit according to manufacturer’s instruction, before library preparation. Nanopore libraries were prepared with Ligation Sequencing Kit (SQK-LSK109). Briefly, 0.2 pmol purified DNA per sample was processed using NEBNext Ultra II End repair/dA-tailing Module(E7546). The samples were multiplexed using Native Barcoding Expansion 1-12 (PCR-free) and Native Barcoding Expansion 13-24 (PCR-free). Adapter was ligated to the pooled libraries with NEBNext Quick Ligation Module (E6056). Finally, 0.05 pmol libraries were loaded into the R9.4.1 flowcell and sequencing was performed with MinKNOW. In general, 48 DNA samples and 8 control DNA samples were sequenced in 5 runs with a sequencing duration beyond 24 hours. Nanopore sequencing for PowerSeq amplicons According to the manufacturer’s instruction, 48 DNA sample were selected and amplified using the PowerSeq™ 46GY System (Promega, WI, USA). 1 ng template DNA of each sample was used 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 for PCR amplification with 29 cycles. The PCR products were clean-up with AmPure XP Bead (Beckman Coulter, Indianapolis, USA) and DNA concentration. The library preparation process is the same as that was described above. Finally, 0.05 pmol libraries were loaded into the R10.3 flowcell and sequencing was performed using MinKNOW. In general, 48 DNA samples were sequenced in 2 runs with a sequencing duration beyond 24 hours.

Bioinformatic analysis for nanopore sequencing data We performed the basecalling process by using Guppy v6.3.4 with high-accuracy model. Then, reads were aligned to the human reference genome (assembly GRCh37, hg19) using Minimap2 v2.17-r941 [30,31]. Finally, SAMtools v1.6 [ 32 ] were employed to convert SAM files into BAM format, which were used as input of NASTRA. Downsampled data were generated using NanoTime, which is available at https://github.com/renzilin/NanoTime.

Pairwise alignment with affine-gap penalty We used parasail-python [29] (https://github.com/jeffdaily/parasail-python) to perform pairwise alignment with affine-gap penalty in read trimming and clustering steps. Because gap open and extension penalties are required to be positive integers, we specified the match score of 75, the mismatch penalty of 90, gap open penalty of 75, and gap extension penalty of 10. Consequently, the alignment prefers to 2 continuous gaps rather than 1 mismatch. If the number of continuous gaps exceeds 2, the alignment prefers 1 mismatch.

Recursive algorithm for repeat structure inference To obtain the number of repeat units and finish the allele call without alignment, we developed a recursive algorithm to infer the repeat structure by searching repeat motif in allele sequence 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 according to the STRBase fact sheets. The algorithm works as following: 1. for each motif, exact match is used to search the segments consisting of a single type of motif; 2. we masked one segment at a time and repeat step 1; 3. remove duplicated masked reads and redo step 2 until no new segment is found; 4. if masked reads converge to the only one, then this is the optimal inference of repeat structure. The schematic representation of algorithm is shown in Figure 6.

Declarations

Ethics approval and consent to participate Our study was approved by the ethics committee of Shanxi Medical University (no. 2020GLL031). Written informed consent was obtained from all participants and all authors.

Consent for publication

Written informed consent was obtained from the participants and all authors for publication of this case report.

Availability of data and materials NanoTime is available on GitHub (https://github.com/renzilin/NanoTime), NASTRA is under GPL v3.0 license and is publicly available on GitHub (https://github.com/renzilin/NASTRA).

Competing interests Not applicable. Funding

Not applicable. 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426

Authors' contributions

ZR developed the computational method, conducted all data analysis and drafted the manuscript. JZ and TY conducted all wet laboratory experiments, including DNA extraction, amplification, and MiSeq FGx and Nanopore sequencing. YZ conducted the test of computational test. PS revised the manuscript. JX revised the manuscripts. JY provided DNA samples and provided guidance on the study. MN offered advice and guidance on the study, and revised the manuscript. All authors approved the manuscript.

Acknowledgements

The authors would like to thank Dr. JinDing Liu, Dr. Fenglong Yang, and Dr. Juan Jia from Shanxi Medical University and Dr. Xu Liu for valuable comments of this study and support. beyond human identification: Implications for development of new DNA typing systems. ELECTROPHORESIS. 1999;20:1682–96. 2. Butler JM. Genetics and genomics of core short tandem repeat loci used in human identity testing. J Forensic Sci. 2006;51:253–65. profiling provides an international reference standard for human cell lines. Proceedings of the National Academy of Sciences. 2001;98:8012–7. 4. Dirks WG, Faehnrich S, Estella IAJ, Drexler HG. Short tandem repeat DNA typing provides an international reference standard for authentication of human cell lines. ALTEX. 2005;22:103–9. 5. Matsuo Y, Nishizaki C, Drexler HG. Efficient DNA fingerprinting method for the identification of cross-culture contamination of cell lines. Hum Cell. 1999;12:149–54. relationship tests that show ambiguous STR results using autosomal SNPs as supplementary 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 cross-contamination initiative: an interactive reference database of STR profiles covering common cancer cell lines. Int J Cancer. 2010;126:303–4. structure and recent improvements towards molecular authentication of human cell lines. of Microbial Pathogens in the Little Bighorn River, Montana. International Journal of Environmental Research and Public Health. 2019;16:1097. Portable Genomics for Early Detection of Plant Viruses and Pests in Sub-Saharan Africa. Genes. 2019;10:632. 2022;23:108. quantification of short tandem repeats on signal data from nanopore sequencing. Genome Biol. 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 short tandem repeat expansions and their methylation state with nanopore sequencing. Nat STRs and SNPs using Verogen’s ForenSeq DNA Signature Prep Kit and MinION. Int J Legal Med. 2021;135:1685–93. Sequencing of a Forensic STR Multiplex Reveals Loci Suitable for Single-Contributor STR Profiling. Genes (Basel). 2020;11:381. repeat identification using a nanopore-based DNA sequencer: a pilot study. J Hum Genet. 2020;65:21–4.

Genet. 2022;56:102629. forensic autosomal STRs using the Oxford Nanopore Technologies MinION device. Forensic Sci Int 2021;37:4572–4. 31. Li H. New strategies to improve minimap2 alignment accuracy. Bioinformatics. SAMtools and BCFtools. GigaScience. 2021;10:giab008. 491 492 493 494 495 496 497 individual samples (88 DNA samples) with different amplification methods and flow cells. With ForenSeq data and PowerSeq data, we optimized the parameters of NASTRA, and conduted performance evalutation and real-time scenario tesing. b. The workflow of NASTRA: 1. Extracing aligned reads from bam file; 2. Flanking regions of each read are trimmed; 3. Reads clustering are performed to find candidate allele sequences; 4. Repeat structure of each allele is inferred by the recursive inference algorithm; 5. Genotype calling. each locus when supporting reads number (SN) is fixed. b. The genotype calling accuracy of NASTRA on each locus with different ratios of supporting reads number (SNR), when SN is 25. c. The distribution of genotyping result across downsampled data. d. The blast identity distribution of alleles between NASTRA and truth. NASTRA and STRspy on ForenSeq data for 27 STR loci. b. The deviation of genotypes between NASTRA and ground truth. c. The correlation of SN and sequencing depth (blue points represent other loci). d. The difference of sequencing depth among genotyping result. results across 46 samples on 22 autosomal STRs sequenced by R10.3 flow cells. b. Accuracy comparsion of NASTRA and STRspy on PowerSeq data for 22 STR loci. For D19S433, the allele information is missing in STRspy custom database. c. The deviation of genotypes between NASTRA and ground truth.

Figure 5. Performance on real-time scenario tesing. a. overview of genotyping results obtained from 8 DNA standard samples across 27 autosomal STRs, with variations in sequencing durations, using a R9.4.1 flow cell. b. The percentages of genotyping results achieving exact match, incomplete match or imbalance/interpretation across all 8 samples, with different sequencing durations. c. The distribution of sequencing depth in relation to different sequencing durations. 524 525

8 human cell lines

Number of samples 16 5,013,171 1348.7

NASTRA STRspy (sec) 50.6

32. Danecek

, Bonfield

, Liddle

, Marshall

, Ohan

, Pollard

, et al. Twelve years of 1846.1 1805.6 1133.7 858.9 1026.9 1281.1 50.6 62.1 44.8 37.1 41.2 48.4 39 .0