Gamberger Dragan, Lavrac Nada, Zelezný Filip, Tolar Jakub
Laboratory for Information Systems, Rudjer Bosković Institute, Zagreb, Croatia.
J Biomed Inform. 2004 Aug;37(4):269-84. doi: 10.1016/j.jbi.2004.07.007.
Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). To avoid this pitfall and achieve predictor robustness, state-of-the-art approaches construct complex classifiers that combine relatively weak contributions of up to thousands of genes (attributes) to classify a disease. The complexity of such classifiers limits their transparency and consequently the biological insights they can provide. The goal of this study is to apply to this domain the methodology of constructing simple yet robust logic-based classifiers amenable to direct expert interpretation. On two well-known, publicly available gene expression classification problems, the paper shows the feasibility of this approach, employing a recently developed subgroup discovery methodology. Some of the discovered classifiers allow for novel biological interpretations.
通过机器学习算法从基因表达数据中寻找疾病标志物(分类器),其特点是由于属性丰富(同时测量的基因表达值)和可用示例(观察结果)短缺,存在数据过度拟合的高风险。为避免这一陷阱并实现预测器的稳健性,当前的先进方法构建了复杂的分类器,这些分类器结合了多达数千个基因(属性)相对较弱的贡献来对疾病进行分类。此类分类器的复杂性限制了它们的透明度,从而也限制了它们所能提供的生物学见解。本研究的目标是将构建简单而稳健的、便于专家直接解释的基于逻辑的分类器的方法应用于该领域。在两个著名的、公开可用的基因表达分类问题上,本文展示了这种方法的可行性,采用了最近开发的子群发现方法。一些发现的分类器允许进行新颖的生物学解释。