Biology Department/Institute of Genomics and Evolutionary Medicine (iGEM), Temple University, (SERC - 645), 1925 N. 12 St, Philadelphia, PA, 19122-1801, USA.
University of Pennsylvania, Wildlife Futures Program, Kennett Square, Philadelphia, PA, 19348, USA.
Malar J. 2024 May 4;23(1):134. doi: 10.1186/s12936-024-04961-8.
Studies on haemosporidian diversity, including origin of human malaria parasites, malaria's zoonotic dynamic, and regional biodiversity patterns, have used target gene approaches. However, current methods have a trade-off between scalability and data quality. Here, a long-read Next-Generation Sequencing protocol using PacBio HiFi is presented. The data processing is supported by a pipeline that uses machine-learning for analysing the reads.
A set of primers was designed to target approximately 6 kb, almost the entire length of the haemosporidian mitochondrial genome. Amplicons from different samples were multiplexed in an SMRTbell® library preparation. A pipeline (HmtG-PacBio Pipeline) to process the reads is also provided; it integrates multiple sequence alignments, a machine-learning algorithm that uses modified variational autoencoders, and a clustering method to identify the mitochondrial haplotypes/species in a sample. Although 192 specimens could be studied simultaneously, a pilot experiment with 15 specimens is presented, including in silico experiments where multiple data combinations were tested.
The primers amplified various haemosporidian parasite genomes and yielded high-quality mt genome sequences. This new protocol allowed the detection and characterization of mixed infections and co-infections in the samples. The machine-learning approach converged into reproducible haplotypes with a low error rate, averaging 0.2% per read (minimum of 0.03% and maximum of 0.46%). The minimum recommended coverage per haplotype is 30X based on the detected error rates. The pipeline facilitates inspecting the data, including a local blast against a file of provided mitochondrial sequences that the researcher can customize.
This is not a diagnostic approach but a high-throughput method to study haemosporidian sequence assemblages and perform genotyping by targeting the mitochondrial genome. Accordingly, the methodology allowed for examining specimens with multiple infections and co-infections of different haemosporidian parasites. The pipeline enables data quality assessment and comparison of the haplotypes obtained to those from previous studies. Although a single locus approach, whole mitochondrial data provide high-quality information to characterize species pools of haemosporidian parasites.
关于血孢子虫多样性的研究,包括人类疟原虫的起源、疟疾的人畜共患动态以及区域生物多样性模式,都使用了靶标基因方法。然而,目前的方法在可扩展性和数据质量之间存在权衡。本研究提出了一种使用 PacBio HiFi 的长读长下一代测序方案。该数据处理得到了一个使用机器学习分析读取的管道的支持。
设计了一组引物来靶向大约 6kb,几乎是整个血孢子虫线粒体基因组的长度。来自不同样本的扩增子在 SMRTbell®文库制备中进行多重化。还提供了一个用于处理读取的管道(HmtG-PacBio 管道);它集成了多个序列比对、一种使用修改后的变分自动编码器的机器学习算法以及一种聚类方法,用于识别样本中的线粒体单倍型/物种。虽然可以同时研究 192 个样本,但本研究呈现了一个包含 15 个样本的试点实验,包括对多种数据组合进行测试的计算机实验。
这些引物扩增了各种血孢子虫寄生虫基因组,并产生了高质量的 mt 基因组序列。该新方案允许检测和表征样本中的混合感染和共感染。机器学习方法收敛到具有低错误率的可重复的单倍型,平均每个读取为 0.2%(最低为 0.03%,最高为 0.46%)。基于检测到的错误率,每个单倍型的最小推荐覆盖率为 30X。该管道便于检查数据,包括对研究人员可以自定义的提供的线粒体序列文件进行本地比对。
这不是一种诊断方法,而是一种针对线粒体基因组进行高通量研究血孢子虫序列组合和进行基因分型的方法。因此,该方法允许检查具有不同血孢子虫寄生虫的多重感染和共感染的样本。该管道能够评估数据质量,并将获得的单倍型与以前的研究结果进行比较。尽管是一种单基因座方法,但整个线粒体数据提供了高质量的信息,可用于描述血孢子虫寄生虫物种池。