Suppr超能文献

学习用于不平衡计算基因组学问题的简约集成方法。

LEARNING PARSIMONIOUS ENSEMBLES FOR UNBALANCED COMPUTATIONAL GENOMICS PROBLEMS.

作者信息

Stanescu Ana, Pandey Gaurav

机构信息

Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

出版信息

Pac Symp Biocomput. 2017;22:288-299. doi: 10.1142/9789813207813_0028.

Abstract

Prediction problems in biomedical sciences are generally quite difficult, partially due to incomplete knowledge of how the phenomenon of interest is influenced by the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor(s) for specific problems. In these situations, a powerful approach to improving prediction performance is to construct ensembles that combine the outputs of many individual base predictors, which have been successful for many biomedical prediction tasks. Moreover, selecting a parsimonious ensemble can be of even greater value for biomedical sciences, where it is not only important to learn an accurate predictor, but also to interpret what novel knowledge it can provide about the target problem. Ensemble selection is a promising approach for this task because of its ability to select a collectively predictive subset, often a relatively small one, of all input base predictors. One of the most well-known algorithms for ensemble selection, CES (Caruana et al.'s Ensemble Selection), generally performs well in practice, but faces several challenges due to the difficulty of choosing the right values of its various parameters. Since the choices made for these parameters are usually ad-hoc, good performance of CES is difficult to guarantee for a variety of problems or datasets. To address these challenges with CES and other such algorithms, we propose a novel heterogeneous ensemble selection approach based on the paradigm of reinforcement learning (RL), which offers a more systematic and mathematically sound methodology for exploring the many possible combinations of base predictors that can be selected into an ensemble. We develop three RL-based strategies for constructing ensembles and analyze their results on two unbalanced computational genomics problems, namely the prediction of protein function and splice sites in eukaryotic genomes. We show that the resultant ensembles are indeed substantially more parsimonious as compared to the full set of base predictors, yet still offer almost the same classification power, especially for larger datasets. The RL ensembles also yield a better combination of parsimony and predictive performance as compared to CES.

摘要

生物医学领域的预测问题通常颇具难度,部分原因在于我们对感兴趣的现象如何受到用于预测的变量和测量值影响的认识尚不完整,同时对于特定问题的理想预测指标也缺乏共识。在这些情况下,提高预测性能的一种有效方法是构建集成模型,将众多单个基础预测器的输出进行组合,这种方法在许多生物医学预测任务中都已取得成功。此外,选择一个简洁的集成模型对于生物医学领域可能具有更大的价值,因为在该领域,不仅要学习一个准确的预测器,还要解读它能为目标问题提供哪些新知识。集成选择是完成这项任务的一种很有前景的方法,因为它能够从所有输入的基础预测器中选择一个具有集体预测能力的子集,而且这个子集通常相对较小。最著名的集成选择算法之一,即CES(卡鲁阿纳等人的集成选择算法),在实际应用中通常表现良好,但由于难以选择其各种参数的合适值,它面临着一些挑战。由于这些参数的选择通常是临时决定的,因此很难保证CES在各种问题或数据集上都能有良好的性能。为了解决CES以及其他此类算法所面临的这些挑战,我们基于强化学习(RL)范式提出了一种新颖的异构集成选择方法,该方法为探索可以选入集成模型的基础预测器的众多可能组合提供了一种更系统且数学上更合理的方法。我们开发了三种基于RL的策略来构建集成模型,并在两个不平衡的计算基因组学问题上分析了它们的结果,这两个问题分别是真核基因组中蛋白质功能和剪接位点的预测。我们表明,与完整的基础预测器集相比,最终得到的集成模型确实要简洁得多,但仍然具有几乎相同的分类能力,尤其是对于较大的数据集。与CES相比,基于RL的集成模型在简洁性和预测性能方面也实现了更好的结合。

相似文献

4
Network inference with ensembles of bi-clustering trees.基于二部聚类树集成的网络推断。
BMC Bioinformatics. 2019 Oct 28;20(1):525. doi: 10.1186/s12859-019-3104-y.
7
Ensemble algorithms in reinforcement learning.强化学习中的集成算法。
IEEE Trans Syst Man Cybern B Cybern. 2008 Aug;38(4):930-6. doi: 10.1109/TSMCB.2008.920231.

引用本文的文献

本文引用的文献

2
Protein function prediction using multilabel ensemble classification.基于多标签集成分类的蛋白质功能预测。
IEEE/ACM Trans Comput Biol Bioinform. 2013 Jul-Aug;10(4):1045-57. doi: 10.1109/TCBB.2013.111.
3
A large-scale evaluation of computational protein function prediction.大规模计算蛋白质功能预测评估。
Nat Methods. 2013 Mar;10(3):221-7. doi: 10.1038/nmeth.2340. Epub 2013 Jan 27.
8
Large-scale prediction of drug-target relationships.药物-靶点关系的大规模预测。
FEBS Lett. 2008 Apr 9;582(8):1283-90. doi: 10.1016/j.febslet.2008.02.024. Epub 2008 Feb 20.
9
Improving the Caenorhabditis elegans genome annotation using machine learning.利用机器学习改进秀丽隐杆线虫基因组注释
PLoS Comput Biol. 2007 Feb 23;3(2):e20. doi: 10.1371/journal.pcbi.0030020. Epub 2006 Dec 21.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验