Department of Health and Human Services, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894, USA.
Bioinformatics. 2018 Mar 1;34(5):755-759. doi: 10.1093/bioinformatics/btx669.
Nucleic acid sequences in public databases should not contain vector contamination, but many sequences in GenBank do (or did) contain vectors. The National Center for Biotechnology Information uses the program VecScreen to screen submitted sequences for contamination. Additional tools are needed to distinguish true-positive (contamination) from false-positive (not contamination) VecScreen matches.
A principal reason for false-positive VecScreen matches is that the sequence and the matching vector subsequence originate from closely related or identical organisms (for example, both originate in Escherichia coli). We collected information on the taxonomy of sources of vector segments in the UniVec database used by VecScreen. We used that information in two overlapping software pipelines for retrospective analysis of contamination in GenBank and for prospective analysis of contamination in new sequence submissions. Using the retrospective pipeline, we identified and corrected over 8000 contaminated sequences in the nonredundant nucleotide database. The prospective analysis pipeline has been in production use since April 2017 to evaluate some new GenBank submissions.
Data on the sources of UniVec entries were included in release 10.0 (ftp://ftp.ncbi.nih.gov/pub/UniVec/). The main software is freely available at https://github.com/aaschaffer/vecscreen_plus_taxonomy.
Supplementary data are available at Bioinformatics online.
公共数据库中的核酸序列不应包含载体污染,但 GenBank 中的许多序列(或曾经)包含载体。国家生物技术信息中心使用 VecScreen 程序筛选提交的序列以检测污染。需要额外的工具来区分真正的阳性(污染)和假阳性(未污染)VecScreen 匹配。
假阳性 VecScreen 匹配的一个主要原因是序列和匹配的载体子序列来自密切相关或相同的生物体(例如,两者都来自大肠杆菌)。我们收集了 VecScreen 使用的 UniVec 数据库中载体片段来源的分类学信息。我们在两个重叠的软件管道中使用该信息对 GenBank 中的污染进行回顾性分析,并对新序列提交进行前瞻性分析。使用回顾性管道,我们在非冗余核苷酸数据库中识别并纠正了 8000 多个污染序列。前瞻性分析管道自 2017 年 4 月以来一直在生产中使用,以评估一些新的 GenBank 提交。
UniVec 条目的来源数据包含在版本 10.0 中(ftp://ftp.ncbi.nih.gov/pub/UniVec/)。主要软件可在 https://github.com/aaschaffer/vecscreen_plus_taxonomy 上免费获得。
补充数据可在 Bioinformatics 在线获得。