Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Changchun, Jilin Province, China.
Jiangsu Co-innovation Center for Prevention and Control of Important Animal Infectious Diseases and Zoonosis, Yangzhou University, Yangzhou, Jiangsu Province, China.
mSystems. 2022 Dec 20;7(6):e0090722. doi: 10.1128/msystems.00907-22. Epub 2022 Oct 26.
Widespread in public databases, foreign contaminant sequences pose a substantial obstacle in genomic analyses. Such contamination in viral genome databases is also notorious but more complicated and often causes questionable results in various applications, particularly in virome-based virus detection. Here, we conducted comprehensive screening and identification of the foreign sequences hidden in the largest eukaryotic viral genome collections of GenBank and UniProt using a scrutiny pipeline, which enables us to rigorously detect those problematic viral sequences (PVSs) with origins in hosts, vectors, and laboratory components. As a result, a total of 766 nucleotide PVSs and 276 amino acid PVSs with lengths up to 6,605 bp were determined, which were widely distributed in 39 families with many involving highly public health-concerning viruses, such as hepatitis C virus, Crimean-Congo hemorrhagic fever virus, and filovirus. The majority of these PVSs are genomic fragments of hosts including humans and bacteria. However, they cannot simply be regarded as foreign contaminants, since parts of them are results of natural occurrence or artificial engineering of viruses. Nevertheless, they severely disturb such sequence-based analyses as genome annotation, taxonomic assignment, and virome profiling. Therefore, we provide a clean version of the eukaryotic viral reference data set by the removal of these PVSs, which allows more accurate virome analysis with less time consumed than with other comprehensive databases. High-throughput sequencing-based viromics highly depends on reference databases, but foreign contamination is widespread in public databases and often leads to confusing and even wrong conclusions in genomic analysis and viromic profiling. To address this issue, we systematically detected and identified the contamination in the largest viral sequence collections of GenBank and UniProt based on a stringent scrutiny pipeline. We found hundreds of PVSs that are related to hosts, vectors, and laboratory components. By the removal of them, the resulting data set greatly improves the accuracy and efficiency of eukaryotic virome profiling. These results refresh our knowledge of the type and origin of PVSs and also have warning implications for viromic analysis. Viromic practitioners should be aware of these problems caused by PVSs and need to realize that a careful review of bioinformatic results is necessary for a reliable conclusion.
广泛存在于公共数据库中的外来污染物序列对基因组分析构成了重大障碍。病毒基因组数据库中的这种污染也很严重,但更为复杂,并且经常在各种应用中导致可疑的结果,特别是在基于病毒组的病毒检测中。在这里,我们使用严格的筛选管道对 GenBank 和 UniProt 中最大的真核病毒基因组集合中隐藏的外来序列进行了全面筛选和鉴定,该管道使我们能够严格检测那些起源于宿主、载体和实验室成分的有问题的病毒序列(PVS)。结果,总共确定了 766 个核苷酸 PVS 和 276 个氨基酸 PVS,长度最长可达 6605bp,广泛分布于 39 个科,其中许多涉及高度关注公共卫生的病毒,如丙型肝炎病毒、克里米亚-刚果出血热病毒和丝状病毒。这些 PVS 中的大多数是包括人类和细菌在内的宿主的基因组片段。然而,它们不能简单地被视为外来污染物,因为它们中的一部分是病毒自然发生或人工工程的结果。然而,它们严重干扰了基于序列的分析,如基因组注释、分类学分配和病毒组分析。因此,我们通过去除这些 PVS 提供了一个经过清理的真核病毒参考数据集,这使得病毒组分析更加准确,并且比使用其他综合数据库消耗的时间更少。基于高通量测序的病毒组学高度依赖于参考数据库,但在公共数据库中外来污染很常见,并且经常导致基因组分析和病毒组分析中的混淆甚至错误结论。为了解决这个问题,我们基于严格的筛选管道系统地检测和鉴定了 GenBank 和 UniProt 中最大的病毒序列集合中的污染。我们发现了数百个与宿主、载体和实验室成分有关的 PVS。通过去除它们,所得数据集极大地提高了真核病毒组分析的准确性和效率。这些结果刷新了我们对 PVS 类型和来源的认识,也对病毒组分析具有警示意义。病毒组学从业者应该意识到 PVS 带来的这些问题,并且需要认识到,为了得出可靠的结论,对生物信息学结果进行仔细审查是必要的。