Brunak S, Engelbrecht J, Knudsen S
Department of Structural Properties of Materials, Technical University of Denmark, Lyngby.
Nucleic Acids Res. 1990 Aug 25;18(16):4797-801. doi: 10.1093/nar/18.16.4797.
The use of databanks in genetic research assumes reliability of the information they contain. Currently, error-detection in the manually or electronically entered data contained in the nucleotide sequence databanks at EMBL, Heidelberg and GenBank at Los Alamos is limited. We have used a subset of sequences from these databanks to train neural networks to recognize pre-mRNA splicing signals in human genes. During the training on 33 human genes from the EMBL databank seven genes appeared to disturb the learning process. Subsequent investigation revealed discrepancies from the original published papers, for three genes. In four genes, we found wrongly assigned splicing frames of introns. We believe this to be a reflection of the fact that splicing frames cannot always be unambiguously assigned on the basis of experimental data. Thus incorrect assignment appear both due to mere typographical misprints as well as erroneous interpretation of experiments. Training on 241 human sequences from GenBank revealed nine new errors. We propose that such errors could be detected by computer algorithms designed to check the consistency of data prior to their incorporation in databanks.
在基因研究中使用数据库时,假定其所包含信息的可靠性。目前,对位于海德堡的欧洲分子生物学实验室(EMBL)和位于洛斯阿拉莫斯的GenBank核苷酸序列数据库中人工录入或电子录入的数据进行错误检测的能力有限。我们利用这些数据库中的一部分序列来训练神经网络,以识别人类基因中的前体信使核糖核酸(pre-mRNA)剪接信号。在对EMBL数据库中的33个人类基因进行训练时,有7个基因似乎干扰了学习过程。随后的调查发现,其中3个基因与最初发表的论文存在差异。在另外4个基因中,我们发现内含子的剪接框架被错误分配。我们认为这反映了一个事实,即仅凭实验数据并不总能明确无误地确定剪接框架。因此,错误的分配既可能是由于排版错误,也可能是对实验的错误解读。对GenBank中的241个人类序列进行训练时,又发现了9个新的错误。我们建议,可以通过设计用于在数据纳入数据库之前检查数据一致性的计算机算法来检测此类错误。