• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用潜在开放阅读框的氨基酸组成和熵来识别蛋白质编码基因。

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes.

作者信息

McNair Katelyn, Ecale Zhou Carol L, Souza Brian, Malfatti Stephanie, Edwards Robert A

机构信息

Computational Sciences Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182, USA.

Lawrence Livermore National Laboratory, Global Security Computing Applications, Livermore, CA 94550, USA.

出版信息

Microorganisms. 2021 Jan 8;9(1):129. doi: 10.3390/microorganisms9010129.

DOI:10.3390/microorganisms9010129
PMID:33429904
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7827183/
Abstract

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

摘要

原核生物基因寻找中的主要步骤之一是确定哪些开放阅读框编码蛋白质,哪些只是偶然出现。有许多不同的方法来区分这两者;最普遍的方法是与已知基因数据库进行共享同源性比较。这种方法存在许多缺陷,最明显的是只能找到之前见过的基因。四个最流行的原核生物基因预测程序(GeneMark、Glimmer、Prodigal、Phanotate)都使用蛋白质编码训练模型来预测蛋白质编码基因,后三个程序允许从输入基因组从头创建训练模型。有不同的方法可用于创建训练模型,为了提高此类工具的准确性,我们在此介绍GOODORFS,一种在所有可能的开放阅读框(ORF)集合中识别蛋白质编码基因的方法。我们的工作流程首先获取每个ORF的氨基酸频率,计算熵密度分布(EDP),使用KMeans对EDP进行聚类,然后选择变异最小的聚类作为编码ORF。为了测试我们方法的有效性,我们在14179个注释的噬菌体基因组上运行了GOODORFS,并将我们的结果与其他四种类似方法(Glimmer、MED2、PHANOTATE、Prodigal)的初始训练集创建步骤进行了比较。我们发现GOODORFS最准确(0.94)且F1分数最高(0.85),而Glimmer精度最高(0.92),PHANOTATE召回率最高(0.96)。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/36ed9633d898/microorganisms-09-00129-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/a8d97977e545/microorganisms-09-00129-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/6a1610891c19/microorganisms-09-00129-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/ecf715bd14ac/microorganisms-09-00129-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/59cf37d136f1/microorganisms-09-00129-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/36ed9633d898/microorganisms-09-00129-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/a8d97977e545/microorganisms-09-00129-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/6a1610891c19/microorganisms-09-00129-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/ecf715bd14ac/microorganisms-09-00129-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/59cf37d136f1/microorganisms-09-00129-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5c77/7827183/36ed9633d898/microorganisms-09-00129-g005.jpg

相似文献

1
Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes.利用潜在开放阅读框的氨基酸组成和熵来识别蛋白质编码基因。
Microorganisms. 2021 Jan 8;9(1):129. doi: 10.3390/microorganisms9010129.
2
Multivariate entropy distance method for prokaryotic gene identification.用于原核基因识别的多变量熵距离方法
J Bioinform Comput Biol. 2004 Jun;2(2):353-73. doi: 10.1142/s0219720004000624.
3
PHANOTATE: a novel approach to gene identification in phage genomes.phanotate:一种在噬菌体基因组中进行基因鉴定的新方法。
Bioinformatics. 2019 Nov 1;35(22):4537-4542. doi: 10.1093/bioinformatics/btz265.
4
[Comprehensive re-annotation of protein-coding genes for prokaryotic genomes by Z-curve and similarity-based methods].[基于Z曲线和相似性方法对原核生物基因组蛋白质编码基因进行全面重新注释]
Yi Chuan. 2020 Jul 20;42(7):691-702. doi: 10.16288/j.yczz.20-022.
5
GeneLook: a novel ab initio gene identification system suitable for automated annotation of prokaryotic sequences.基因查找:一种适用于原核生物序列自动注释的新型从头基因识别系统。
Gene. 2005 Feb 14;346:115-25. doi: 10.1016/j.gene.2004.10.018. Epub 2005 Jan 26.
6
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
7
Finding prokaryotic genes by the 'frame-by-frame' algorithm: targeting gene starts and overlapping genes.通过“逐帧”算法寻找原核生物基因:靶向基因起始位点和重叠基因。
Bioinformatics. 1999 Nov;15(11):874-86. doi: 10.1093/bioinformatics/15.11.874.
8
Missing genes in the annotation of prokaryotic genomes.原核生物基因组注释中缺失的基因。
BMC Bioinformatics. 2010 Mar 15;11:131. doi: 10.1186/1471-2105-11-131.
9
10
ProsmORF-pred: a machine learning-based method for the identification of small ORFs in prokaryotic genomes.ProsmORF-pred:一种基于机器学习的方法,用于鉴定原核基因组中的小开放阅读框。
Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad101.

引用本文的文献

1
Analysis of RNA translation with a deep learning architecture provides new insight into translation control.使用深度学习架构对RNA翻译进行分析,为翻译控制提供了新的见解。
Nucleic Acids Res. 2025 Apr 10;53(7). doi: 10.1093/nar/gkaf277.
2
Multiomic Analysis of Environmental Effects and Nitrogen Use Efficiency of Two Potato Varieties Under High Nitrogen Conditions.高氮条件下两个马铃薯品种环境效应与氮素利用效率的多组学分析
Plants (Basel). 2025 Feb 20;14(5):633. doi: 10.3390/plants14050633.
3
Analysis of RNA translation with a deep learning architecture provides new insight into translation control.

本文引用的文献

1
Array programming with NumPy.使用 NumPy 进行数组编程。
Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16.
2
PHANOTATE: a novel approach to gene identification in phage genomes.phanotate:一种在噬菌体基因组中进行基因鉴定的新方法。
Bioinformatics. 2019 Nov 1;35(22):4537-4542. doi: 10.1093/bioinformatics/btz265.
3
Analyses of four new Caulobacter Phicbkviruses indicate independent lineages.分析四个新的柄杆菌 Phicbkviruses 表明独立的谱系。
使用深度学习架构分析RNA翻译为翻译控制提供了新的见解。
bioRxiv. 2024 Jul 2:2023.07.08.548206. doi: 10.1101/2023.07.08.548206.
4
Special Issue "Bacteriophage Genomics": Editorial.特刊“噬菌体基因组学”:编辑意见
Microorganisms. 2023 Mar 8;11(3):693. doi: 10.3390/microorganisms11030693.
5
MultiPhATE2: code for functional annotation and comparison of phage genomes.MultiPhATE2:用于噬菌体基因组功能注释和比较的代码。
G3 (Bethesda). 2021 May 7;11(5). doi: 10.1093/g3journal/jkab074.
J Gen Virol. 2019 Feb;100(2):321-331. doi: 10.1099/jgv.0.001218. Epub 2019 Jan 18.
4
Prodigal: prokaryotic gene recognition and translation initiation site identification.普罗迪格:原核基因识别和翻译起始位点鉴定。
BMC Bioinformatics. 2010 Mar 8;11:119. doi: 10.1186/1471-2105-11-119.
5
MED: a new non-supervised gene prediction algorithm for bacterial and archaeal genomes.MED:一种用于细菌和古细菌基因组的新型无监督基因预测算法。
BMC Bioinformatics. 2007 Mar 16;8:97. doi: 10.1186/1471-2105-8-97.
6
CRITICA: coding region identification tool invoking comparative analysis.CRITICA:调用比较分析的编码区域识别工具。
Mol Biol Evol. 1999 Apr;16(4):512-24. doi: 10.1093/oxfordjournals.molbev.a026133.
7
Microbial gene identification using interpolated Markov models.使用插值马尔可夫模型进行微生物基因识别。
Nucleic Acids Res. 1998 Jan 15;26(2):544-8. doi: 10.1093/nar/26.2.544.
8
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.流感嗜血杆菌Rd的全基因组随机测序与组装
Science. 1995 Jul 28;269(5223):496-512. doi: 10.1126/science.7542800.
9
Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene.噬菌体MS2 RNA的完整核苷酸序列:复制酶基因的一级和二级结构
Nature. 1976 Apr 8;260(5551):500-7. doi: 10.1038/260500a0.