用于大规模基因组数据集中高度分化病毒的基因型和亚型检测的有效引物设计。

Effective primer design for genotype and subtype detection of highly divergent viruses in large scale genome datasets.

作者信息

Demiralay Burak, Can Tolga

机构信息

Department of Health Informatics, Informatics Institute, Middle East Technical University, Dumlupınar Bulvarı No 1, 06800, Çankaya, Ankara, Turkey.

Department of Computer Science, Colorado School of Mines, 1501 Illionis St, Golden, 80401, CO, USA.

出版信息

BMC Bioinformatics. 2025 Sep 1;26(1):223. doi: 10.1186/s12859-025-06251-9.

Abstract

Identification of microorganisms in a biological sample is a crucial step in diagnostics, pathogen screening, biomedical research, evolutionary studies, agriculture, and biological threat assessment. While progress has been made in studying larger organisms, there is a need for an efficient and scalable method that can handle thousands of whole genomes for organisms with high mutation rates and genetic diversity such as single stranded viruses. In this study, we developed a novel method to identify subsequences for detection of a given species/subspecies in a (meta)genomic sample using the Polymerase Chain Reaction (PCR) method. Species detection in any analysis depends highly on the measurement method and since thermodynamic interactions are critical in PCR, thermodynamics is the main driving force in the proposed methodology. Our method is parallelized in multiple steps and involves extracting all oligonucleotides from target genomes. We then locate the target sites for each oligonucleotide using the constructed suffix array and local alignment followed by thermodynamic interaction assessment. An important requirement for subspecies identification is to avoid amplifying a non-target set of genomes and our method addresses this. We applied our method to three highly divergent viruses; (1) Hepatitis C virus (HCV), where the subtypes differ in 31-33% of nucleotide sites on average, (2) Human immunodeficiency virus (HIV), for which, 25-35% between-subtype and 15-20% within-subtype variation is observed, and (3) the Dengue virus, whose respective genomes (only DENV 1-4) share 60% sequence identity to each other. Using our method, we were able to select oligonucleotides that can identify in silico 99.9% of 1657 HCV genomes, 99.7% of 11,838 HIV genomes, and 95.4% of 4016 Dengue genomes. We also show subspecies identification on genotypes 1-6 of HCV and genotypes 1-4 of the Dengue virus with more than 99.5% true positive and less than 0.05% false positive rate, on average. None of the state-of-the-art methods can produce oligonucleotides with this specificity and sensitivity on highly divergent viral genomes like the ones studied in this article.

摘要

在生物样本中鉴定微生物是诊断、病原体筛查、生物医学研究、进化研究、农业和生物威胁评估中的关键步骤。虽然在研究较大生物体方面已取得进展,但对于一种高效且可扩展的方法仍有需求,该方法能够处理数千个具有高突变率和遗传多样性的生物体的全基因组,如单链病毒。在本研究中,我们开发了一种新方法,使用聚合酶链反应(PCR)方法在(宏)基因组样本中鉴定用于检测给定物种/亚种的子序列。在任何分析中,物种检测高度依赖于测量方法,并且由于热力学相互作用在PCR中至关重要,所以热力学是所提出方法的主要驱动力。我们的方法在多个步骤中并行化,包括从目标基因组中提取所有寡核苷酸。然后,我们使用构建的后缀数组和局部比对来定位每个寡核苷酸的目标位点,随后进行热力学相互作用评估。亚种鉴定的一个重要要求是避免扩增非目标基因组集,我们的方法解决了这个问题。我们将我们的方法应用于三种高度分化的病毒:(1)丙型肝炎病毒(HCV),其亚型平均在31 - 33%的核苷酸位点上存在差异;(2)人类免疫缺陷病毒(HIV),观察到其亚型间差异为25 - 35%,亚型内差异为15 - 20%;(3)登革病毒,其各自的基因组(仅登革病毒1 - 4型)彼此间具有60%的序列同一性。使用我们的方法,我们能够选择出在计算机模拟中可鉴定1657个HCV基因组中的99.9%、11838个HIV基因组中的99.7%以及401个登革病毒基因组中的95.4%的寡核苷酸。我们还展示了对HCV的1 - 6基因型和登革病毒的1 - 4基因型进行亚种鉴定,平均真阳性率超过99.5%,假阳性率低于0.05%。目前的任何先进方法都无法在像本文所研究的这种高度分化的病毒基因组上产生具有这种特异性和灵敏度的寡核苷酸。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5d24/12400757/925ceb59cbc3/12859_2025_6251_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索