分子序列准确性与蛋白质编码区分析

Molecular sequence accuracy and the analysis of protein coding regions.

作者信息

States D J, Botstein D

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894.

出版信息

Proc Natl Acad Sci U S A. 1991 Jul 1;88(13):5518-22. doi: 10.1073/pnas.88.13.5518.

DOI:10.1073/pnas.88.13.5518

PMID:2062834

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC51908/

Abstract

Molecular sequences, like all experimental data, have finite error rates. The impact of errors on the information content of molecular sequence data is dependent on the analytic paradigm used to interpret the data. We studied the impact of nucleic acid sequence errors on the ability to align predicted amino acid sequences with the sequences of related proteins. We found that with a simultaneous translation and alignment algorithm, identification of sequence homologies is resilient to the introduction of random errors. Proteins with greater than 30% sequence identity can be reliably recognized even in the presence of 1% frameshifting (insertion or deletion) error rates and 5% base substitution rates. Incorporation of prior knowledge about the location and characteristics of errors improves tolerance to error of amino acid sequence alignments. Similarly, inclusion of prior knowledge of biased codon utilization by yeast (Saccharomyces cerevisiae) allows reliable detection of correct reading frames in yeast sequences even in the presence of 5% substitution and 1% frameshift errors.

摘要

与所有实验数据一样，分子序列具有有限的错误率。错误对分子序列数据信息内容的影响取决于用于解释数据的分析范式。我们研究了核酸序列错误对将预测的氨基酸序列与相关蛋白质序列进行比对能力的影响。我们发现，使用同步翻译和比对算法时，序列同源性的识别对随机错误的引入具有弹性。即使存在1%的移码（插入或缺失）错误率和5%的碱基替换率，序列同一性大于30%的蛋白质也能被可靠识别。纳入有关错误位置和特征的先验知识可提高氨基酸序列比对的错误耐受性。同样，纳入酵母（酿酒酵母）密码子使用偏好的先验知识，即使存在5%的替换和1%的移码错误，也能可靠检测酵母序列中的正确阅读框。

相似文献

Molecular sequence accuracy and the analysis of protein coding regions.分子序列准确性与蛋白质编码区分析

Proc Natl Acad Sci U S A. 1991 Jul 1;88(13):5518-22. doi: 10.1073/pnas.88.13.5518.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

Alignments of DNA and protein sequences containing frameshift errors.包含移码错误的DNA和蛋白质序列比对。

Comput Appl Biosci. 1996 Feb;12(1):31-40. doi: 10.1093/bioinformatics/12.1.31.

MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons.MACSE：考虑移码和终止密码子的编码序列多重比对。

PLoS One. 2011;6(9):e22594. doi: 10.1371/journal.pone.0022594. Epub 2011 Sep 16.

Multiple sequence alignments of partially coding nucleic acid sequences.部分编码核酸序列的多序列比对

BMC Bioinformatics. 2005 Jun 28;6:160. doi: 10.1186/1471-2105-6-160.

A frameshift error detection algorithm for DNA sequencing projects.一种用于DNA测序项目的移码错误检测算法。

Nucleic Acids Res. 1995 Aug 11;23(15):2900-8. doi: 10.1093/nar/23.15.2900.

A tool for multiple sequence alignment.一种用于多序列比对的工具。

Proc Natl Acad Sci U S A. 1989 Jun;86(12):4412-5. doi: 10.1073/pnas.86.12.4412.

Combined use of sequence similarity and codon bias for coding region identification.结合序列相似性和密码子偏好性进行编码区识别。

J Comput Biol. 1994 Spring;1(1):39-50. doi: 10.1089/cmb.1994.1.39.

transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.transAlign：利用氨基酸促进蛋白质编码DNA序列的多重比对。

BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.

Divergence and conservation of SUP2 (SUP35) gene of yeast Pichia pinus and Saccharomyces cerevisiae.酵母毕赤酵母和酿酒酵母SUP2（SUP35）基因的差异与保守性

Yeast. 1990 Nov-Dec;6(6):461-72. doi: 10.1002/yea.320060603.

引用本文的文献

Sensitive and error-tolerant annotation of protein-coding DNA with BATH.利用BATH对蛋白质编码DNA进行灵敏且容错的注释。

Bioinform Adv. 2024 Jun 14;4(1):vbae088. doi: 10.1093/bioadv/vbae088. eCollection 2024.

Sensitive and error-tolerant annotation of protein-coding DNA with BATH.使用BATH对蛋白质编码DNA进行灵敏且容错的注释。

bioRxiv. 2024 Jan 1:2023.12.31.573773. doi: 10.1101/2023.12.31.573773.

Highly improved homopolymer aware nucleotide-protein alignments with 454 data.使用 454 数据进行高度改进的同源聚合物识别核苷酸-蛋白质比对。

BMC Bioinformatics. 2012 Sep 12;13:230. doi: 10.1186/1471-2105-13-230.

An approach for searching insertions in bacterial genes leading to the phase shift of triplet periodicity.一种搜索导致三联体周期性相移的细菌基因插入的方法。

Genomics Proteomics Bioinformatics. 2011 Oct;9(4-5):158-70. doi: 10.1016/S1672-0229(11)60019-3.

Error and error mitigation in low-coverage genome assemblies.低覆盖度基因组组装中的错误与错误缓解。

PLoS One. 2011 Feb 14;6(2):e17034. doi: 10.1371/journal.pone.0017034.

Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST.基于组成的统计和翻译后的核苷酸搜索：改进BLAST的TBLASTN模块

BMC Biol. 2006 Dec 7;4:41. doi: 10.1186/1741-7007-4-41.

Sequence alignment by cross-correlation.通过互相关进行序列比对。

J Biomol Tech. 2005 Dec;16(4):453-8.

Having a BLAST with bioinformatics (and avoiding BLASTphemy).享受生物信息学带来的乐趣（并避免亵渎生物信息学）。

Genome Biol. 2001;2(10):REVIEWS2002. doi: 10.1186/gb-2001-2-10-reviews2002. Epub 2001 Sep 27.

PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames.逐对比较与搜索比较：在将蛋白质谱与所有DNA翻译框架进行同步比较时找到最佳比对。

Nucleic Acids Res. 1996 Jul 15;24(14):2730-9. doi: 10.1093/nar/24.14.2730.

Assignment of position-specific error probability to primary DNA sequence data.将特定位置的错误概率分配到原始DNA序列数据中。

Nucleic Acids Res. 1994 Apr 11;22(7):1272-80. doi: 10.1093/nar/22.7.1272.

本文引用的文献

Comparative biosequence metrics.比较生物序列度量

J Mol Evol. 1981;18(1):38-46. doi: 10.1007/BF01733210.

Identification of common molecular subsequences.常见分子子序列的鉴定

J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5.

Codon catalog usage is a genome strategy modulated for gene expressivity.密码子编目使用是一种为基因表达性而调节的基因组策略。

Nucleic Acids Res. 1981 Jan 10;9(1):r43-74. doi: 10.1093/nar/9.1.213-b.

Recognition of protein coding regions in DNA sequences.DNA序列中蛋白质编码区域的识别。

Nucleic Acids Res. 1982 Sep 11;10(17):5303-18. doi: 10.1093/nar/10.17.5303.

Establishing homologies in protein sequences.确定蛋白质序列中的同源性。

Methods Enzymol. 1983;91:524-45. doi: 10.1016/s0076-6879(83)91049-2.

Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. Differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs.酵母转移RNA丰度与蛋白质基因中相应密码子出现之间的相关性。参照同功受体转移RNA的丰度，酵母和大肠杆菌同义密码子选择模式的差异。

J Mol Biol. 1982 Jul 15;158(4):573-97. doi: 10.1016/0022-2836(82)90250-9.

Sequencing end-labeled DNA with base-specific chemical cleavages.通过碱基特异性化学切割对末端标记的DNA进行测序。

Methods Enzymol. 1980;65(1):499-560. doi: 10.1016/s0076-6879(80)65059-9.

Structure of two related rat pancreatic trypsin genes.两个相关大鼠胰蛋白酶基因的结构

J Biol Chem. 1984 Nov 25;259(22):14255-64.

Primary structure of human neutrophil elastase.人中性粒细胞弹性蛋白酶的一级结构。

Proc Natl Acad Sci U S A. 1987 Apr;84(8):2228-32. doi: 10.1073/pnas.84.8.2228.

Multiplex DNA sequencing.多重DNA测序

Science. 1988 Apr 8;240(4849):185-8. doi: 10.1126/science.3353714.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。