Suppr超能文献

基于规则的机器学习在候选疾病基因优先级和癌症基因表达数据样本分类中的应用。

Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.

机构信息

Interdisciplinary Computing and Complex Systems Research Group, University of Nottingham, Nottingham, United Kingdom.

出版信息

PLoS One. 2012;7(7):e39932. doi: 10.1371/journal.pone.0039932. Epub 2012 Jul 11.

Abstract

Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.

摘要

微阵列数据分析已被证明是研究癌症和遗传疾病的有效工具。虽然经典的机器学习技术已成功应用于寻找信息基因并预测新样本的类别标签,但微阵列分析的常见限制,如小样本量、大属性空间和高噪声水平,仍然限制了其科学和临床应用。增加预测模型的可解释性,同时保持高精度,将有助于更有效地利用微阵列数据中的信息内容。为此,我们在三个公共的微阵列癌症数据集上评估了我们基于规则的进化机器学习系统 BioHEL 和 GAssist,为样本分类获得了简单的基于规则的模型。与基于三种不同特征选择算法的其他基准微阵列样本分类器的比较表明,这些进化学习技术可以与支持向量机等最先进的方法竞争。所获得的模型在两级外部交叉验证中达到了 90%以上的准确率,并且通过仅使用简单的“如果-那么-否则”规则组合来促进解释,从而具有附加价值。作为进一步的好处,文献挖掘分析表明,从 BioHEL 的分类规则集提取的信息基因的优先级可以在相关疾病术语与排名最高的基因的标准化名称之间的点互信息方面胜过传统的集成特征选择的基因排名。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9524/3394775/c7170f769a5b/pone.0039932.g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验