Suppr超能文献

利用领域信息重构生物预测。

Leveraging domain information to restructure biological prediction.

机构信息

Department of Computer and Information Science, University of Mississippi, USA.

出版信息

BMC Bioinformatics. 2011 Oct 18;12 Suppl 10(Suppl 10):S22. doi: 10.1186/1471-2105-12-S10-S22.

Abstract

BACKGROUND

It is commonly believed that including domain knowledge in a prediction model is desirable. However, representing and incorporating domain information in the learning process is, in general, a challenging problem. In this research, we consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The goal of this research is to develop an algorithm to identify discrete or categorical attributes that maximally simplify the learning task.

RESULTS

We consider restructuring a supervised learning problem via a partition of the problem space using a discrete or categorical attribute. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We propose a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach is tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem.

CONCLUSIONS

The proposed conditional entropy based metric is effective in identifying good partitions of a classification problem, hence enhancing the prediction performance.

摘要

背景

人们普遍认为,在预测模型中纳入领域知识是可取的。然而,在学习过程中表示和结合领域信息通常是一个具有挑战性的问题。在这项研究中,我们考虑由离散或分类属性编码的领域信息。离散或分类属性提供了问题域的自然划分,从而将原始问题划分为几个不重叠的子问题。从这个意义上说,如果划分简化了学习任务,那么领域信息是有用的。本研究的目的是开发一种算法,以识别最大程度简化学习任务的离散或分类属性。

结果

我们考虑通过使用离散或分类属性对问题空间进行分区来重新构建监督学习问题。一种简单的方法是通过穷举搜索所有可能的重构问题。当离散或分类属性的数量很大时,这种方法在计算上是不可行的。我们提出了一种根据属性减少分类任务不确定性的潜力对属性进行排序的度量标准。它被量化为使用一组最优分类器实现的条件熵,每个分类器都是为考虑中的属性定义的子问题构建的。为了避免高计算成本,我们通过随机投影的期望最小条件熵来近似求解。该方法在三个人工数据集、三个化学信息学数据集和两个白血病基因表达数据集上进行了测试。实验结果表明,我们的方法能够选择合适的离散或分类属性来简化问题,即构建的重构问题分类器的性能始终优于原始问题的性能。

结论

所提出的基于条件熵的度量标准在识别分类问题的良好分区方面是有效的,从而提高了预测性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9885/3236845/794eb60cb4bd/1471-2105-12-S10-S22-1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验