Exotic and Emerging Avian Viral Disease Research Unit, Southeast Poultry Research Laboratory, Agricultural Research Service, USDA, Athens, GA, USA.
Influenza Other Respir Viruses. 2014 Jul;8(4):499-505. doi: 10.1111/irv.12239. Epub 2014 Feb 7.
There are over 276 000 influenza gene sequences in public databases, with the quality of the sequences determined by the contributor.
As part of a high school class project, influenza sequences with possible errors were identified in the public databases based on the size of the gene being longer than expected, with the hypothesis that these sequences would have an error. Students contacted sequence submitters alerting them of the possible sequence issue(s) and requested they the suspect sequence(s) be correct as appropriate.
Type A influenza viruses were screened, and gene segments longer than the accepted size were identified for further analysis. Attention was placed on sequences with additional nucleotides upstream or downstream of the highly conserved non-coding ends of the viral segments.
A total of 1081 sequences were identified that met this criterion. Three types of errors were commonly observed: non-influenza primer sequence wasn't removed from the sequence; PCR product was cloned and plasmid sequence was included in the sequence; and Taq polymerase added an adenine at the end of the PCR product. Internal insertions of nucleotide sequence were also commonly observed, but in many cases it was unclear if the sequence was correct or actually contained an error. A total of 215 sequences, or 22.8% of the suspect sequences, were corrected in the public databases in the first year of the student project. Unfortunately 138 additional sequences with possible errors were added to the databases in the second year. Additional awareness of the need for data integrity of sequences submitted to public databases is needed to fully reap the benefits of these large data sets.
公共数据库中已有超过 276000 个流感基因序列,其质量由贡献者决定。
作为高中班级项目的一部分,根据基因长度长于预期这一特征,在公共数据库中识别出可能存在错误的流感序列,假设这些序列存在错误。学生联系序列提交者,提醒他们可能存在序列问题,并要求他们在适当的情况下纠正可疑序列。
筛选 A 型流感病毒,并进一步分析基因片段大于可接受大小的序列。重点关注那些在病毒片段高度保守的非编码端上下游有额外核苷酸的序列。
共确定了 1081 条符合这一标准的序列。常见的错误类型有:非流感引物序列未从序列中去除;PCR 产物被克隆,质粒序列包含在序列中;Taq 聚合酶在 PCR 产物的末端添加了一个腺嘌呤。还经常观察到核苷酸序列的内部插入,但在许多情况下,不清楚序列是否正确,或者实际上是否存在错误。在学生项目的第一年,共有 215 个序列(占可疑序列的 22.8%)在公共数据库中得到纠正。不幸的是,第二年又有 138 个可能存在错误的序列添加到数据库中。需要进一步提高对提交到公共数据库的序列数据完整性的认识,才能充分利用这些大型数据集。