Resch Wolfgang, Zaslavsky Leonid, Kiryutin Boris, Rozanov Michael, Bao Yiming, Tatusova Tatiana A
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.
BMC Microbiol. 2009 Apr 2;9:65. doi: 10.1186/1471-2180-9-65.
There is an increasing number of complete and incomplete virus genome sequences available in public databases. This large body of sequence data harbors information about epidemiology, phylogeny, and virulence. Several specialized databases, such as the NCBI Influenza Virus Resource or the Los Alamos HIV database, offer sophisticated query interfaces along with integrated exploratory data analysis tools for individual virus species to facilitate extracting this information. Thus far, there has not been a comprehensive database for dengue virus, a significant public health threat.
We have created an integrated web resource for dengue virus. The technology developed for the NCBI Influenza Virus Resource has been extended to process non-segmented dengue virus genomes. In order to allow efficient processing of the dengue genome, which is large in comparison with individual influenza segments, we developed an offline pre-alignment procedure which generates a multiple sequence alignment of all dengue sequences. The pre-calculated alignment is then used to rapidly create alignments of sequence subsets in response to user queries. This improvement in technology will also facilitate the incorporation of additional virus species in the future. The set of virus-specific databases at NCBI, which will be referred to as Virus Variation Resources (VVR), allow users to build complex queries against virus-specific databases and then apply exploratory data analysis tools to the results. The metadata is automatically collected where possible, and extended with data extracted from the literature.
The NCBI Dengue Virus Resource integrates dengue sequence information with relevant metadata (sample collection time and location, disease severity, serotype, sequenced genome region) and facilitates retrieval and preliminary analysis of dengue sequences using integrated web analysis and visualization tools.
公共数据库中可获取的完整和不完整病毒基因组序列数量日益增加。这大量的序列数据蕴含着有关流行病学、系统发育和毒力的信息。一些专门的数据库,如NCBI流感病毒资源库或洛斯阿拉莫斯HIV数据库,为单个病毒物种提供了复杂的查询界面以及集成的探索性数据分析工具,以方便提取这些信息。到目前为止,还没有一个针对登革病毒的综合数据库,而登革病毒是一个重大的公共卫生威胁。
我们创建了一个登革病毒综合网络资源。为NCBI流感病毒资源库开发的技术已得到扩展,以处理非分段的登革病毒基因组。为了高效处理与单个流感片段相比很大的登革病毒基因组,我们开发了一种离线预比对程序,该程序生成所有登革病毒序列的多序列比对。然后,使用预先计算的比对来快速创建序列子集的比对,以响应用户查询。技术上的这一改进也将便于未来纳入更多病毒物种。NCBI的一组病毒特异性数据库,将被称为病毒变异资源(VVR),允许用户针对病毒特异性数据库构建复杂查询,然后对结果应用探索性数据分析工具。元数据在可能的情况下自动收集,并通过从文献中提取的数据进行扩展。
NCBI登革病毒资源库将登革病毒序列信息与相关元数据(样本采集时间和地点、疾病严重程度、血清型、测序基因组区域)整合在一起,并使用集成的网络分析和可视化工具促进登革病毒序列的检索和初步分析。