Goodacre Norman, Aljanahi Aisha, Nandakumar Subhiksha, Mikailov Mike, Khan Arifa S
Division of Viral Products, Office of Vaccines Research and Review, Center for Biologics Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA.
Division of Imaging, Diagnostics and Software Reliability, Office of Science & Engineering Laboratories, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, Maryland, USA.
mSphere. 2018 Mar 14;3(2). doi: 10.1128/mSphereDirect.00069-18. eCollection 2018 Mar-Apr.
Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection.
通过高通量测序(HTS)检测远缘相关病毒在生物信息学上具有挑战性,因为缺乏一个包含所有病毒序列且没有大量非病毒序列的公共数据库,这可能会延长运行时间并掩盖病毒命中结果。我们的参考病毒数据库(RVDB)包括所有病毒、病毒相关和病毒样核苷酸序列(不包括细菌病毒),无论长度如何,并且总体上减少了细胞序列。使用语义选择标准(SEM-I)从GenBank中选择病毒序列,从而生成第一代病毒数据库(VDB)。对该数据库进行了人工和计算审查,得出了优化的语义选择标准(SEM-R),并将其应用于新下载的更新后的GenBank序列,以创建第二代VDB。后者中的病毒条目通过CD-HIT-EST以98%的相似度进行聚类,以减少冗余,同时保留高病毒序列多样性。通过在NCBI数据库中进行BLAST搜索以及在PFAM和DFAM数据库中进行HMMER搜索,确认了聚类代表性序列(creps)的病毒身份。所得的RVDB包含了广泛的病毒家族代表性、序列多样性以及减少的细胞成分;它包括全长和部分序列以及内源性非逆转录病毒元件、内源性逆转录病毒和逆转座子。使用内部HTS转录组数据集对RVDBv10.2进行测试表明,与查询包含所有病毒序列但也包含非病毒序列的整个NCBI非冗余核苷酸数据库相比,病毒检测的运行速度明显更快。RVDB可供公众使用,以促进HTS分析,特别是用于新型病毒检测。它将定期更新,以纳入添加到GenBank的新病毒序列。为了便于对高通量测序(HTS)数据进行生物信息学分析以检测已知和新型病毒,我们开发了一个新的参考病毒数据库(RVDB),该数据库通过纳入所有病毒、病毒样和病毒相关序列(不包括噬菌体),无论其大小如何,提供了来自真核生物的不同病毒物种的广泛代表性。特别是,RVDB包含内源性非逆转录病毒元件、内源性逆转录病毒和逆转座子。对序列进行聚类以减少冗余,同时保留高病毒序列多样性。RVDB的一个特别有用的功能是减少细胞序列,这可以提高大型转录组和基因组数据分析的运行效率,并增加病毒检测的特异性。