• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用核苷酸统计对编码DNA进行分类。

Classifying coding DNA with nucleotide statistics.

作者信息

Carels Nicolas, Frías Diego

机构信息

Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil.

出版信息

Bioinform Biol Insights. 2009 Oct 28;3:141-54. doi: 10.4137/bbi.s3030.

DOI:10.4137/bbi.s3030
PMID:20140062
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2808172/
Abstract

In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.

摘要

在本报告中,我们比较了密码子结构因子(CSF)和我们称为通用特征法(UFM)的方法对编码序列(CDS)与内含子的分类成功率。UFM基于嘌呤偏好性(Rrr)评分和终止密码子频率。我们表明,UFM对CDS/内含子的分类成功率高于CSF。UFM通过基于以下因素的评分将开放阅读框(ORF)分类为编码或非编码:(i)终止密码子分布,(ii)核苷酸三联体三个位置上嘌呤概率的乘积,(iii)三联体第一、第二和第三位置上胞嘧啶(C)、鸟嘌呤(G)和腺嘌呤(A)概率的乘积,(iv)三联体第一和第二位置上G的概率,以及(v)其GC3与GC2水平到通用相关性回归线的距离。人类(>250 bp)、黑腹果蝇(>250 bp)和拟南芥(>200 bp)中超过80%的CDS(真阳性)被成功分类,假阳性率低于或等于5%。该方法在其编码链和编码框中释放编码序列,这使得它们能够以95%的置信度自动翻译成蛋白质序列。该方法是编码序列中核苷酸组成偏倚的自然结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/d790ac353385/bbi-2009-141f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/6d5b9af3f161/bbi-2009-141f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/8f4cbd9701f3/bbi-2009-141f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/d13fab5a48ad/bbi-2009-141f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/097580dd4a0b/bbi-2009-141f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/5fa6bb731cec/bbi-2009-141f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/d790ac353385/bbi-2009-141f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/6d5b9af3f161/bbi-2009-141f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/8f4cbd9701f3/bbi-2009-141f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/d13fab5a48ad/bbi-2009-141f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/097580dd4a0b/bbi-2009-141f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/5fa6bb731cec/bbi-2009-141f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d841/2808172/d790ac353385/bbi-2009-141f6.jpg

相似文献

1
Classifying coding DNA with nucleotide statistics.利用核苷酸统计对编码DNA进行分类。
Bioinform Biol Insights. 2009 Oct 28;3:141-54. doi: 10.4137/bbi.s3030.
2
A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences.一种无需训练步骤的转录组序列编码框架分类统计方法。
Bioinform Biol Insights. 2013;7:35-54. doi: 10.4137/BBI.S10053. Epub 2013 Jan 23.
3
Universal Features for the Classification of Coding and Non-coding DNA Sequences.编码和非编码DNA序列分类的通用特征
Bioinform Biol Insights. 2009 Jun 3;3:37-49. doi: 10.4137/bbi.s2236.
4
The Purine Bias of Coding Sequences is Determined by Physicochemical Constraints on Proteins.编码序列的嘌呤偏好性由蛋白质的物理化学限制决定。
Bioinform Biol Insights. 2014 May 20;8:93-108. doi: 10.4137/BBI.S13161. eCollection 2014.
5
Constraint on di-nucleotides by codon usage bias in bacterial genomes.细菌基因组中密码子使用偏好对二核苷酸的限制。
Gene. 2014 Feb 15;536(1):18-28. doi: 10.1016/j.gene.2013.11.098. Epub 2013 Dec 11.
6
The majority of long non-stop reading frames on the antisense strand can be explained by biased codon usage.反义链上大多数长的不间断阅读框可以用密码子使用偏好来解释。
Gene. 1997 Jul 18;194(1):143-55. doi: 10.1016/s0378-1119(97)00199-6.
7
The base contents of A, C, G or U for the three codon positions and the total coding sequences show positive correlation.三个密码子位置的A、C、G或U的碱基含量与总编码序列呈正相关。
J Biomol Struct Dyn. 1998 Aug;16(1):51-7. doi: 10.1080/07391102.1998.10508226.
8
An Interpretation of the Ancestral Codon from Miller's Amino Acids and Nucleotide Correlations in Modern Coding Sequences.基于现代编码序列中米勒氨基酸与核苷酸相关性对祖先密码子的解读
Bioinform Biol Insights. 2015 Apr 15;9:37-47. doi: 10.4137/BBI.S24021. eCollection 2015.
9
[Analysis of correlation of local GC level in human protein coding genes].[人类蛋白质编码基因中局部GC水平的相关性分析]
Yi Chuan. 2008 Sep;30(9):1169-74. doi: 10.3724/sp.j.1005.2008.01169.
10
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.

引用本文的文献

1
A Metagenomic Analysis of Bacterial Microbiota in the Digestive Tract of Triatomines.锥蝽消化道细菌微生物群的宏基因组分析
Bioinform Biol Insights. 2017 Sep 27;11:1177932217733422. doi: 10.1177/1177932217733422. eCollection 2017.
2
Common and phylogenetically widespread coding for peptides by bacterial small RNAs.细菌小RNA对肽进行编码的现象普遍存在且在系统发育上广泛存在。
BMC Genomics. 2017 Jul 21;18(1):553. doi: 10.1186/s12864-017-3932-y.
3
An Interpretation of the Ancestral Codon from Miller's Amino Acids and Nucleotide Correlations in Modern Coding Sequences.

本文引用的文献

1
Universal Features for the Classification of Coding and Non-coding DNA Sequences.编码和非编码DNA序列分类的通用特征
Bioinform Biol Insights. 2009 Jun 3;3:37-49. doi: 10.4137/bbi.s2236.
2
Evolution and functions of long noncoding RNAs.长链非编码RNA的进化与功能
Cell. 2009 Feb 20;136(4):629-41. doi: 10.1016/j.cell.2009.02.006.
3
Noncoding RNAs in Long-Term Memory Formation.长期记忆形成中的非编码RNA
基于现代编码序列中米勒氨基酸与核苷酸相关性对祖先密码子的解读
Bioinform Biol Insights. 2015 Apr 15;9:37-47. doi: 10.4137/BBI.S24021. eCollection 2015.
4
The Purine Bias of Coding Sequences is Determined by Physicochemical Constraints on Proteins.编码序列的嘌呤偏好性由蛋白质的物理化学限制决定。
Bioinform Biol Insights. 2014 May 20;8:93-108. doi: 10.4137/BBI.S13161. eCollection 2014.
5
A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences.一种无需训练步骤的转录组序列编码框架分类统计方法。
Bioinform Biol Insights. 2013;7:35-54. doi: 10.4137/BBI.S10053. Epub 2013 Jan 23.
Neuroscientist. 2008 Oct;14(5):434-45. doi: 10.1177/1073858408319187.
4
Steady progress and recent breakthroughs in the accuracy of automated genome annotation.自动基因组注释准确性方面的稳步进展和近期突破。
Nat Rev Genet. 2008 Jan;9(1):62-73. doi: 10.1038/nrg2220.
5
Gene identification in novel eukaryotic genomes by self-training algorithm.基于自训练算法的新型真核生物基因组基因识别
Nucleic Acids Res. 2005 Nov 28;33(20):6494-506. doi: 10.1093/nar/gki937. Print 2005.
6
Detection of nucleolar organizer and mitochondrial DNA insertion regions based on the isochore map of Arabidopsis thaliana.基于拟南芥等密度区图谱检测核仁组织区和线粒体DNA插入区域。
FEBS J. 2005 Jul;272(13):3328-36. doi: 10.1111/j.1742-4658.2005.04748.x.
7
Multi-criterial coding sequence prediction. Combination of GeneMark with two novel, coding-character specific quantities.多标准编码序列预测。GeneMark与两个新的、编码特征特异性量的组合。
Comput Biol Med. 2005 Oct;35(7):627-43. doi: 10.1016/j.compbiomed.2004.04.002.
8
Measuring the coding potential of genomic sequences through a combination of triplet occurrence patterns and RNY preference.通过三联体出现模式和RNY偏好的组合来测量基因组序列的编码潜力。
J Mol Evol. 2004 Sep;59(3):309-16. doi: 10.1007/s00239-004-2626-7.
9
Consistent over-estimation of gene number in complex plant genomes.复杂植物基因组中基因数量的持续高估。
Curr Opin Plant Biol. 2004 Dec;7(6):732-6. doi: 10.1016/j.pbi.2004.09.003.
10
The mutual information theory for the certification of rice coding sequences.用于水稻编码序列鉴定的互信息理论
FEBS Lett. 2004 Jun 18;568(1-3):155-8. doi: 10.1016/j.febslet.2004.05.026.