Suppr超能文献

使用神经网络和信息论确定真核生物蛋白质编码区域

Determination of eukaryotic protein coding regions using neural networks and information theory.

作者信息

Farber R, Lapedes A, Sirotkin K

机构信息

Theoretical Division, Los Alamos National Laboratory, NM 87545.

出版信息

J Mol Biol. 1992 Jul 20;226(2):471-9. doi: 10.1016/0022-2836(92)90961-i.

Abstract

Our previous work applied neural network techniques to the problem of discriminating open reading frame (ORF) sequences taken from introns versus exons. The method counted the codon frequencies in an ORF of a specified length, and then used this codon frequency representation of DNA fragments to train a neural net (essentially a Perceptron with a sigmoidal, or "soft step function", output) to perform this discrimination. After training, the network was then applied to a disjoint "predict" set of data to assess accuracy. The resulting accuracy in our previous work was 98.4%, exceeding accuracies reported in the literature at that time for other algorithms. Here, we report even higher accuracies stemming from calculations of mutual information (a correlation measure) of spatially separated codons in exons, and in introns. Significant mutual information exists in exons, but not in introns, between adjacent codons. This suggests that dicodon frequencies of adjacent codons are important for intron/exon discrimination. We report that accuracies obtained using a neural net trained on the frequency of dicodons is significantly higher at smaller fragment lengths than even our original results using codon frequencies, which were already higher than simple statistical methods that also used codon frequencies. We also report accuracies obtained from including codon and dicodon statistics in all six reading frames, i.e. the three frames on the original and complement strand. Inclusion of six-frame statistics increases the accuracy still further. We also compare these neural net results to a Bayesian statistical prediction method that assumes independent codon frequencies in each position. The performance of the Bayesian scheme is poorer than any of the neural based schemes, however many methods reported in the literature either explicitly, or implicitly, use this method. Specifically, Bayesian prediction schemes based on codon frequencies achieve 90.9% accuracy on 90 codon ORFs, while our best neural net scheme reaches 99.4% accuracy on 60 codon ORFs. "Accuracy" is defined as the average of the exon and intron sensitivities. Achievement of sufficiently high accuracies on short fragment lengths can be useful in providing a computational means of finding coding regions in unannotated DNA sequences such as those arising from the mega-base sequencing efforts of the Human Genome Project. We caution that the high accuracies reported here do not represent a complete solution to the problem of identifying exons in "raw" base sequences. The accuracies are considerably lower from exons of small length, although still higher than accuracies reported in the literature for other methods. Short exon lengths are not uncommon.(ABSTRACT TRUNCATED AT 400 WORDS)

摘要

我们之前的工作将神经网络技术应用于区分取自内含子与外显子的开放阅读框(ORF)序列的问题。该方法计算指定长度的ORF中的密码子频率,然后使用DNA片段的这种密码子频率表示来训练神经网络(本质上是一个具有 sigmoid 或“软阶跃函数”输出的感知器)以进行这种区分。训练后,将网络应用于不相交的“预测”数据集以评估准确性。我们之前工作中得到的准确率为98.4%,超过了当时文献中报道的其他算法的准确率。在此,我们报告,由于计算了外显子和内含子中空间上分离的密码子的互信息(一种相关性度量),得到了更高的准确率。在外显子中,相邻密码子之间存在显著的互信息,但在内含子中不存在。这表明相邻密码子的双密码子频率对于内含子/外显子区分很重要。我们报告,使用基于双密码子频率训练的神经网络在较小片段长度上获得的准确率甚至显著高于我们最初使用密码子频率的结果,而我们最初的结果已经高于同样使用密码子频率的简单统计方法。我们还报告了在所有六个阅读框(即原始链和互补链上的三个框架)中纳入密码子和双密码子统计信息所获得的准确率。纳入六框架统计信息进一步提高了准确率。我们还将这些神经网络结果与一种贝叶斯统计预测方法进行比较,该方法假设每个位置的密码子频率是独立的。贝叶斯方案的性能比任何基于神经网络的方案都要差,然而文献中报道的许多方法要么明确地,要么隐含地使用了这种方法。具体而言,基于密码子频率的贝叶斯预测方案在90个密码子的ORF上达到90.9%的准确率,而我们最好的神经网络方案在60个密码子的ORF上达到99.4%的准确率。“准确率”定义为外显子和内含子敏感性的平均值。在短片段长度上实现足够高的准确率对于提供一种计算方法来在未注释的DNA序列(如人类基因组计划的兆碱基测序工作产生的序列)中找到编码区域可能是有用的。我们提醒,这里报道的高准确率并不代表解决了在“原始”碱基序列中识别外显子的问题。来自短长度外显子的准确率要低得多,尽管仍然高于文献中报道的其他方法的准确率。短外显子长度并不罕见。(摘要截断于400字)

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验