Suppr超能文献

神经网络可检测mRNA剪接位点分配中的错误。

Neural network detects errors in the assignment of mRNA splice sites.

作者信息

Brunak S, Engelbrecht J, Knudsen S

机构信息

Department of Structural Properties of Materials, Technical University of Denmark, Lyngby.

出版信息

Nucleic Acids Res. 1990 Aug 25;18(16):4797-801. doi: 10.1093/nar/18.16.4797.

Abstract

The use of databanks in genetic research assumes reliability of the information they contain. Currently, error-detection in the manually or electronically entered data contained in the nucleotide sequence databanks at EMBL, Heidelberg and GenBank at Los Alamos is limited. We have used a subset of sequences from these databanks to train neural networks to recognize pre-mRNA splicing signals in human genes. During the training on 33 human genes from the EMBL databank seven genes appeared to disturb the learning process. Subsequent investigation revealed discrepancies from the original published papers, for three genes. In four genes, we found wrongly assigned splicing frames of introns. We believe this to be a reflection of the fact that splicing frames cannot always be unambiguously assigned on the basis of experimental data. Thus incorrect assignment appear both due to mere typographical misprints as well as erroneous interpretation of experiments. Training on 241 human sequences from GenBank revealed nine new errors. We propose that such errors could be detected by computer algorithms designed to check the consistency of data prior to their incorporation in databanks.

摘要

在基因研究中使用数据库时,假定其所包含信息的可靠性。目前,对位于海德堡的欧洲分子生物学实验室(EMBL)和位于洛斯阿拉莫斯的GenBank核苷酸序列数据库中人工录入或电子录入的数据进行错误检测的能力有限。我们利用这些数据库中的一部分序列来训练神经网络,以识别人类基因中的前体信使核糖核酸(pre-mRNA)剪接信号。在对EMBL数据库中的33个人类基因进行训练时,有7个基因似乎干扰了学习过程。随后的调查发现,其中3个基因与最初发表的论文存在差异。在另外4个基因中,我们发现内含子的剪接框架被错误分配。我们认为这反映了一个事实,即仅凭实验数据并不总能明确无误地确定剪接框架。因此,错误的分配既可能是由于排版错误,也可能是对实验的错误解读。对GenBank中的241个人类序列进行训练时,又发现了9个新的错误。我们建议,可以通过设计用于在数据纳入数据库之前检查数据一致性的计算机算法来检测此类错误。

相似文献

2
Cleaning the GenBank Arabidopsis thaliana data set.清理GenBank拟南芥数据集。
Nucleic Acids Res. 1996 Jan 15;24(2):316-20. doi: 10.1093/nar/24.2.316.

引用本文的文献

1
SignalP: The Evolution of a Web Server.SignalP:一个网络服务器的发展历程。
Methods Mol Biol. 2024;2836:331-367. doi: 10.1007/978-1-0716-4007-4_17.
7
Cleaning the GenBank Arabidopsis thaliana data set.清理GenBank拟南芥数据集。
Nucleic Acids Res. 1996 Jan 15;24(2):316-20. doi: 10.1093/nar/24.2.316.

本文引用的文献

3
Computer methods to locate signals in nucleic acid sequences.在核酸序列中定位信号的计算机方法。
Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):505-19. doi: 10.1093/nar/12.1part2.505.
7
Diverse mechanisms in the generation of human beta-tubulin pseudogenes.
Science. 1982 Aug 6;217(4559):549. doi: 10.1126/science.6178164.
8
Prediction of splice junctions in mRNA sequences.mRNA序列中剪接位点的预测。
Nucleic Acids Res. 1985 Jul 25;13(14):5327-40. doi: 10.1093/nar/13.14.5327.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验