用于更高精度计算基因预测的全局判别学习

Global discriminative learning for higher-accuracy computational gene prediction.

作者信息

Bernal Axel, Crammer Koby, Hatzigeorgiou Artemis, Pereira Fernando

机构信息

Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America.

出版信息

PLoS Comput Biol. 2007 Mar 16;3(3):e54. doi: 10.1371/journal.pcbi.0030054. Epub 2007 Feb 2.

DOI:10.1371/journal.pcbi.0030054

PMID:17367206

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1828702/

Abstract

Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.

摘要

大多数从头开始的基因预测器使用概率序列模型，通常是隐马尔可夫模型，来组合分别训练的基因组信号和内容模型。通过组合相关基因组特征的单独模型，此类基因预测器可以利用小训练集和不完整注释，并且可以相当高效地进行训练。然而，那种类型的分段训练并不能优化预测准确性，并且难以考虑基因模型不同部分之间的统计依赖性。随着基因组信息以不断增加的速度产生，值得研究替代方法，在这些方法中，可以通过判别式学习整合具有复杂统计依赖性的许多不同类型的基因组证据，以最大化注释准确性。在判别式学习方法中，大间隔分类器因其在许多分类任务中支持向量机（SVM）的成功而变得突出。我们描述了CRAIG，这是一个基于具有半马尔可夫结构的条件随机场模型的从头开始基因预测新程序，该模型使用与多类SVM相关的在线大间隔算法进行训练。我们在基准脊椎动物数据集和ENCODE项目区域上的实验表明，与仅使用内在特征的已发表基因预测器相比，预测准确性有显著提高，特别是在基因水平以及具有长内含子的基因上。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c156/1847997/059070be5ee6/pcbi.0030054.g001.jpg

相似文献

Global discriminative learning for higher-accuracy computational gene prediction.用于更高精度计算基因预测的全局判别学习

PLoS Comput Biol. 2007 Mar 16;3(3):e54. doi: 10.1371/journal.pcbi.0030054. Epub 2007 Feb 2.

Identification of coding and non-coding sequences using local Holder exponent formalism.使用局部赫尔德指数形式主义识别编码和非编码序列。

Bioinformatics. 2005 Oct 15;21(20):3818-23. doi: 10.1093/bioinformatics/bti639. Epub 2005 Aug 23.

Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages.基于基因组上下文聚类和判别学习的基因功能预测：在噬菌体中的应用

BMC Bioinformatics. 2007 May 22;8 Suppl 4(Suppl 4):S6. doi: 10.1186/1471-2105-8-S4-S6.

Mismatch string kernels for discriminative protein classification.用于判别式蛋白质分类的错配字符串核

Bioinformatics. 2004 Mar 1;20(4):467-76. doi: 10.1093/bioinformatics/btg431. Epub 2004 Jan 22.

[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]

Yi Chuan Xue Bao. 2004 May;31(5):431-43.

SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection.SVM-HUSTLE——一种用于成对蛋白质远程同源性检测的迭代半监督机器学习方法。

Bioinformatics. 2008 Mar 15;24(6):783-90. doi: 10.1093/bioinformatics/btn028. Epub 2008 Feb 1.

Discriminative learning for dynamic state prediction.用于动态状态预测的判别式学习

IEEE Trans Pattern Anal Mach Intell. 2009 Oct;31(10):1847-61. doi: 10.1109/TPAMI.2009.37.

POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions.POODLE-L：一种用于可靠预测长无序区域的两级支持向量机预测系统。

Bioinformatics. 2007 Aug 15;23(16):2046-53. doi: 10.1093/bioinformatics/btm302. Epub 2007 Jun 1.

HMM-ModE--improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences.HMM-ModE——通过优化判别阈值并利用负训练序列修改发射概率，使用轮廓隐马尔可夫模型改进分类。

BMC Bioinformatics. 2007 Mar 27;8:104. doi: 10.1186/1471-2105-8-104.

Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。

Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.

引用本文的文献

The nuclear and mitochondrial genomes of Frieseomelitta varia - a highly eusocial stingless bee (Meliponini) with a permanently sterile worker caste.变侧异胡蜂的核基因组和线粒体基因组——一个具有永久性不育工蜂等级的高度真社会性无刺蜂（Meliponini）。

BMC Genomics. 2020 Jun 3;21(1):386. doi: 10.1186/s12864-020-06784-8.

Affiliated Fusion Conditional Random Field for Urban UAV Image Semantic Segmentation.

Sensors (Basel). 2020 Feb 12;20(4):993. doi: 10.3390/s20040993.

A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction.基于逻辑回归的剪接位点预测任务的域适应分类器研究

IEEE Trans Nanobioscience. 2016 Mar;15(2):75-83. doi: 10.1109/TNB.2016.2522400. Epub 2016 Jan 28.

Chromerid genomes reveal the evolutionary path from photosynthetic algae to obligate intracellular parasites.色素虫基因组揭示了从光合藻类到专性细胞内寄生虫的进化路径。

Elife. 2015 Jul 15;4:e06974. doi: 10.7554/eLife.06974.

Integration of RNA-seq and proteomics data with genomics for improved genome annotation in Apicomplexan parasites.整合RNA测序和蛋白质组学数据与基因组学以改进顶复门寄生虫的基因组注释

Proteomics. 2015 Aug;15(15):2557-9. doi: 10.1002/pmic.201500253.

A large-scale proteogenomics study of apicomplexan pathogens-Toxoplasma gondii and Neospora caninum.顶复门病原体——刚地弓形虫和犬新孢子虫的大规模蛋白质基因组学研究。

Proteomics. 2015 Aug;15(15):2618-28. doi: 10.1002/pmic.201400553. Epub 2015 May 15.

Novel Gene Discovery in the Human Malaria Parasite using Nucleosome Positioning Data.利用核小体定位数据在人类疟原虫中发现新基因

Comput Syst Bioinformatics Conf. 2010 Aug;9:124-135.

Effective automated feature construction and selection for classification of biological sequences.用于生物序列分类的有效自动特征构建与选择

PLoS One. 2014 Jul 17;9(7):e99982. doi: 10.1371/journal.pone.0099982. eCollection 2014.

Reassessing domain architecture evolution of metazoan proteins: the contribution of different evolutionary mechanisms.重新评估后生动物蛋白结构域进化：不同进化机制的贡献。

Genes (Basel). 2011 Aug 5;2(3):578-98. doi: 10.3390/genes2030578.

Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors.重新评估后生动物蛋白结构域架构进化：基因预测错误的主要影响。

Genes (Basel). 2011 Jul 13;2(3):449-501. doi: 10.3390/genes2030449.

本文引用的文献

Improving the Caenorhabditis elegans genome annotation using machine learning.利用机器学习改进秀丽隐杆线虫基因组注释

PLoS Comput Biol. 2007 Feb 23;3(2):e20. doi: 10.1371/journal.pcbi.0030020. Epub 2006 Dec 21.

EGASP: the human ENCODE Genome Annotation Assessment Project.EGASP：人类ENCODE基因组注释评估项目。

Genome Biol. 2006;7 Suppl 1(Suppl 1):S2.1-31. doi: 10.1186/gb-2006-7-s1-s2. Epub 2006 Aug 7.

Using multiple alignments to improve gene prediction.使用多重比对来改进基因预测。

J Comput Biol. 2006 Mar;13(2):379-93. doi: 10.1089/cmb.2006.13.379.

An empirical analysis of training protocols for probabilistic gene finders.概率性基因预测程序训练协议的实证分析

BMC Bioinformatics. 2004 Dec 21;5:206. doi: 10.1186/1471-2105-5-206.

The ENCODE (ENCyclopedia Of DNA Elements) Project.DNA 元件百科全书（ENCODE）计划

Science. 2004 Oct 22;306(5696):636-40. doi: 10.1126/science.1105136.

TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders.TigrScan和GlimmerHMM：两款开源的从头开始的真核生物基因预测工具。

Bioinformatics. 2004 Nov 1;20(16):2878-9. doi: 10.1093/bioinformatics/bth315. Epub 2004 May 14.

GeneWise and Genomewise.基因比对软件GeneWise和基因组比对软件Genomewise

Genome Res. 2004 May;14(5):988-95. doi: 10.1101/gr.1865504.

EnsMart: a generic system for fast and flexible access to biological data.EnsMart：一个用于快速灵活访问生物数据的通用系统。

Genome Res. 2004 Jan;14(1):160-9. doi: 10.1101/gr.1645104.

Eval: a software package for analysis of genome annotations.Eval：一个用于分析基因组注释的软件包。

BMC Bioinformatics. 2003 Oct 17;4:50. doi: 10.1186/1471-2105-4-50.

Gene prediction with a hidden Markov model and a new intron submodel.基于隐马尔可夫模型和新型内含子子模型的基因预测

Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. doi: 10.1093/bioinformatics/btg1080.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于更高精度计算基因预测的全局判别学习

Global discriminative learning for higher-accuracy computational gene prediction.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献