蛋白质编码指标评估。

Assessment of protein coding measures.

作者信息

Fickett J W, Tung C S

机构信息

Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, NM 87545.

出版信息

Nucleic Acids Res. 1992 Dec 25;20(24):6441-50. doi: 10.1093/nar/20.24.6441.

DOI:10.1093/nar/20.24.6441

PMID:1480466

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC334555/

Abstract

A number of methods for recognizing protein coding genes in DNA sequence have been published over the last 13 years, and new, more comprehensive algorithms, drawing on the repertoire of existing techniques, continue to be developed. To optimize continued development, it is valuable to systematically review and evaluate published techniques. At the core of most gene recognition algorithms is one or more coding measures--functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of 'typical' exonic DNA. In this paper we review and synthesize the underlying coding measures from published algorithms. A standardized benchmark is described, and each of the measures is evaluated according to this benchmark. Our main conclusion is that a very simple and obvious measure--counting oligomers--is more effective than any of the more sophisticated measures. Different measures contain different information. However there is a great deal of redundancy in the current suite of measures. We show that in future development of gene recognition algorithms, attention can probably be limited to six of the twenty or so measures proposed to date.

摘要

在过去的13年里，已经发表了许多用于识别DNA序列中蛋白质编码基因的方法，并且基于现有技术的新的、更全面的算法仍在不断开发。为了优化持续发展，系统地回顾和评估已发表的技术是很有价值的。大多数基因识别算法的核心是一个或多个编码度量——这些函数在给定任何序列样本窗口的情况下，产生一个数字或向量，旨在衡量样本序列与“典型”外显子DNA窗口的相似程度。在本文中，我们回顾并综合了已发表算法中的潜在编码度量。描述了一个标准化的基准，并根据这个基准对每个度量进行评估。我们的主要结论是，一个非常简单且明显的度量——计算寡聚物——比任何更复杂的度量都更有效。不同的度量包含不同的信息。然而，当前的度量组中存在大量冗余。我们表明，在基因识别算法的未来发展中，注意力可能可以局限于迄今为止提出的二十多种度量中的六种。

相似文献

Assessment of protein coding measures.蛋白质编码指标评估。

Nucleic Acids Res. 1992 Dec 25;20(24):6441-50. doi: 10.1093/nar/20.24.6441.

Locating protein coding regions in human DNA using a decision tree algorithm.使用决策树算法在人类DNA中定位蛋白质编码区域。

J Comput Biol. 1995 Fall;2(3):473-85. doi: 10.1089/cmb.1995.2.473.

A new fourier transform approach for protein coding measure based on the format of the Z curve.一种基于Z曲线格式的用于蛋白质编码度量的新型傅里叶变换方法。

Bioinformatics. 1998;14(8):685-90. doi: 10.1093/bioinformatics/14.8.685.

A Fourier characteristic of coding sequences: origins and a non-Fourier approximation.编码序列的傅里叶特征：起源与非傅里叶近似

J Comput Biol. 2005 Nov;12(9):1153-65. doi: 10.1089/cmb.2005.12.1153.

Prediction of probable genes by Fourier analysis of genomic sequences.通过基因组序列的傅里叶分析预测可能的基因。

Comput Appl Biosci. 1997 Jun;13(3):263-70. doi: 10.1093/bioinformatics/13.3.263.

Recognizing shorter coding regions of human genes based on the statistics of stop codons.基于终止密码子统计识别人类基因的较短编码区域。

Biopolymers. 2002 Mar;63(3):207-16. doi: 10.1002/bip.10054.

Comparison of various algorithms for recognizing short coding sequences of human genes.用于识别人类基因短编码序列的各种算法的比较。

Bioinformatics. 2004 Mar 22;20(5):673-81. doi: 10.1093/bioinformatics/btg467. Epub 2004 Feb 5.

[Comparison study on the methods for finding borders between coding and non-coding DNA regions in rice].[水稻编码与非编码DNA区域边界查找方法的比较研究]

Yi Chuan. 2005 Jul;27(4):629-35.

Indications that "codon boundaries" are physico-chemically defined and that protein-folding information is contained in the redundant exon bases.有迹象表明“密码子边界”是由物理化学定义的，并且蛋白质折叠信息包含在冗余的外显子碱基中。

Theor Biol Med Model. 2006 Aug 7;3:28. doi: 10.1186/1742-4682-3-28.

Representation of DNA sequences in genetic codon context with applications in exon and intron prediction.遗传密码子背景下DNA序列的表示及其在外显子和内含子预测中的应用。

J Bioinform Comput Biol. 2015 Apr;13(2):1550004. doi: 10.1142/S0219720015500043. Epub 2014 Dec 10.

引用本文的文献

RNAincoder: a deep learning-based encoder for RNA and RNA-associated interaction.RNAincoder：一种基于深度学习的 RNA 及其相关相互作用的编码器。

Nucleic Acids Res. 2023 Jul 5;51(W1):W509-W519. doi: 10.1093/nar/gkad404.

Nanopore-Based Direct RNA Sequencing of the Transcriptome Identifies Novel lncRNAs.基于纳米孔的直接 RNA 测序对转录组进行分析，可鉴定新型 lncRNAs。

Genes (Basel). 2023 Feb 28;14(3):610. doi: 10.3390/genes14030610.

LncCat: An ORF attention model to identify LncRNA based on ensemble learning strategy and fused sequence information.LncCat：一种基于集成学习策略和融合序列信息来识别长链非编码RNA的开放阅读框注意力模型。

Comput Struct Biotechnol J. 2023 Feb 8;21:1433-1447. doi: 10.1016/j.csbj.2023.02.012. eCollection 2023.

LncDC: a machine learning-based tool for long non-coding RNA detection from RNA-Seq data.LncDC：一种基于机器学习的 RNA-Seq 数据中长非编码 RNA 检测工具。

Sci Rep. 2022 Nov 9;12(1):19083. doi: 10.1038/s41598-022-22082-7.

Common Features in lncRNA Annotation and Classification: A Survey.长链非编码RNA注释与分类的共同特征：一项综述。

Noncoding RNA. 2021 Dec 13;7(4):77. doi: 10.3390/ncrna7040077.

AI applications in functional genomics.人工智能在功能基因组学中的应用。

Comput Struct Biotechnol J. 2021 Oct 11;19:5762-5790. doi: 10.1016/j.csbj.2021.10.009. eCollection 2021.

RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences.RNAsamba：基于神经网络的RNA序列蛋白质编码潜力评估

NAR Genom Bioinform. 2020 Jan 13;2(1):lqz024. doi: 10.1093/nargab/lqz024. eCollection 2020 Mar.

LncLocation: Efficient Subcellular Location Prediction of Long Non-Coding RNA-Based Multi-Source Heterogeneous Feature Fusion.LncLocation：基于长链非编码 RNA 的多源异质特征融合的高效亚细胞定位预测。

Int J Mol Sci. 2020 Oct 1;21(19):7271. doi: 10.3390/ijms21197271.

Alternative Splicing of the SLCO1B1 Gene: An Exploratory Analysis of Isoform Diversity in Pediatric Liver.SLCO1B1 基因的选择性剪接：儿科肝脏中同工型多样性的探索性分析。

Clin Transl Sci. 2020 May;13(3):509-519. doi: 10.1111/cts.12733. Epub 2020 Jan 9.

CPPred: coding potential prediction based on the global description of RNA sequence.CPPred：基于 RNA 序列全局描述的编码潜能预测。

Nucleic Acids Res. 2019 May 7;47(8):e43. doi: 10.1093/nar/gkz087.

本文引用的文献

Distance, size and shape.距离、大小和形状。

Ann Eugen. 1954 Mar;18(4):337-43. doi: 10.1111/j.1469-1809.1952.tb02527.x.

Recognition of protein coding regions in DNA sequences.DNA序列中蛋白质编码区域的识别。

Nucleic Acids Res. 1982 Sep 11;10(17):5303-18. doi: 10.1093/nar/10.17.5303.

Codon preference and its use in identifying protein coding regions in long DNA sequences.密码子偏好性及其在长DNA序列中识别蛋白质编码区的应用。

Nucleic Acids Res. 1982 Jan 11;10(1):141-56. doi: 10.1093/nar/10.1.141.

Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification.从嘌呤/嘧啶基因组序列确定蛋白质阅读框的方法及其可能的进化依据。

Proc Natl Acad Sci U S A. 1981 Mar;78(3):1596-600. doi: 10.1073/pnas.78.3.1596.

The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression.密码子偏好性图：蛋白质编码序列的图形分析及基因表达预测

Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):539-49. doi: 10.1093/nar/12.1part2.539.

A prevalent persistent global nonrandomness that distinguishes coding and non-coding eucaryotic nuclear DNA sequences.一种普遍存在的持续性全球非随机性，它区分了编码和非编码真核细胞核DNA序列。

J Mol Evol. 1983;19(2):122-33. doi: 10.1007/BF02300750.

The coding function of nucleotide sequences can be discerned by statistical analysis.核苷酸序列的编码功能可以通过统计分析来识别。

J Theor Biol. 1981 Feb 7;88(3):409-20. doi: 10.1016/0022-5193(81)90274-5.

Measurements of the effects that coding for a protein has on a DNA sequence and their use for finding genes.蛋白质编码对DNA序列的影响的测量及其在寻找基因中的应用。

Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):551-67. doi: 10.1093/nar/12.1part2.551.

The relationship between base composition and codon usage in bacterial genes and its use for the simple and reliable identification of protein-coding sequences.细菌基因中碱基组成与密码子使用之间的关系及其在蛋白质编码序列简单可靠鉴定中的应用。

Gene. 1984 Oct;30(1-3):157-66. doi: 10.1016/0378-1119(84)90116-1.

Delineation of coding areas in DNA sequences through assignment of codon probabilities.通过密码子概率分配来描绘DNA序列中的编码区域。

J Biomol Struct Dyn. 1985 Dec;3(3):543-9. doi: 10.1080/07391102.1985.10508442.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验