基于临床标志物的乳腺癌预后混合分类模型。
Mixture classification model based on clinical markers for breast cancer prognosis.
机构信息
School of Computer, Wuhan University, Wuhan 430079, China.
出版信息
Artif Intell Med. 2010 Feb-Mar;48(2-3):129-37. doi: 10.1016/j.artmed.2009.07.008. Epub 2009 Dec 14.
OBJECTIVE
Accurate cancer prognosis prediction is critical to cancer treatment. There have been many prognosis models based on clinical markers, but few of them are satisfied in clinical applications. And with the development of microarray technologies, cancer researchers have discovered many genes as new markers from the gene expression data and have further developed powerful prognosis models based on these so-called genetic biomarkers. However, the application of such biomarkers still suffers from some problems. The first one is there are a great number of genes and a few samples in the gene expression data so that it is difficult to select a unified gene set to establish a stable classifier for prognosis. The second one is that, due to the experimental and technical reasons, there are existing noises and redundancies in gene expression data, which may lead to building a prognosis predictor with poor performance. The last but not the least one is the microarray experiments are so expensive currently that it is hard to obtain abundant samples. Therefore, it is practical to develop prognosis methods mainly based on conventional clinical markers in real cancer treatment applications. This paper aims to establish an accurate classification model for cancer prognosis, in order to make full use of the invaluable information in clinical data, especially which is usually ignored by most of the existing methods when they aim for high prediction accuracies.
METHODS
First, this paper gives the formal description of general classification problem, and presents a novel mixture classification model to make full use of the invaluable information in clinical data, which is similar to the traditional ensemble classification models except for putting strict constraints on the construction of mapping functions to avoid voting process. Then, a two-layer instance of the proposed model, named as MRS (Mixture of Rough set and Support vector machine), is constructed by integrating rough set and support vector machine (SVM) classification methods, in which, the rough set classifier acts as the first layer to identify some singular samples in data, and the SVM classifier acts as the second layer to classify the remaining samples. Finally, MRS is used to make prognosis prediction on two open breast cancer datasets. One dataset, denoted as BRC-1 hereafter, is a high quality, publicly available dataset of 97 breast cancer tumors of node-negative patients. The other, denoted as BRC-2 hereafter, uses baseline human primary breast tumor data from LBL breast cancer cell collection containing 174 samples.
RESULTS
We have done two experiments on BRC-1 and BRC-2, respectively. In the first experiment, the BRC-1 dataset is divided into train set with 78 patients (34 ones belonging to poor prognosis group and 44 ones belonging to good prognosis group) and test set with 19 patients (12 ones belonging to poor prognosis group and 7 ones belonging to good prognosis). After trained on the train set, the MRS can correctly classify all the 12 patients with poor prognosis, and 6 of 7 patients with good prognosis in the test set. The results are better than previous researches, even better than the 70-gene based biomarkers. And in the second experiment, we construct the classifiers using BRC-2 dataset, and compare MRS with other representative methods in Weka software by 5-fold cross-validation, and comparison results show that MRS has higher prediction accuracy than those methods.
CONCLUSIONS
The proposed mixture classification model can easily integrate methods with different characteristics. It can overcome the shortcomings of traditional voting-based ensemble models and thus can make full use of the information in clinical data. The experimental results illustrate that our implemented MRS classifier can predict the breast cancer prognosis more accurately than previous prognostic methods.
目的
准确的癌症预后预测对于癌症治疗至关重要。已经有许多基于临床标志物的预后模型,但在临床应用中很少有令人满意的。随着微阵列技术的发展,癌症研究人员已经从基因表达数据中发现了许多作为新标志物的基因,并进一步基于这些所谓的遗传生物标志物开发了强大的预后模型。然而,此类生物标志物的应用仍存在一些问题。第一个问题是基因表达数据中的基因数量众多,而样本数量较少,因此很难选择统一的基因集来建立稳定的预后分类器。第二个问题是,由于实验和技术原因,基因表达数据中存在噪声和冗余,这可能导致构建性能不佳的预后预测器。最后但并非最不重要的一点是,目前微阵列实验非常昂贵,很难获得丰富的样本。因此,在实际的癌症治疗应用中,开发主要基于常规临床标志物的预后方法是切实可行的。本文旨在建立一个准确的癌症预后分类模型,以便充分利用临床数据中宝贵的信息,特别是大多数现有方法在追求高预测精度时通常忽略的信息。
方法
首先,本文给出了一般分类问题的形式化描述,并提出了一种新的混合分类模型,以充分利用临床数据中宝贵的信息,该模型与传统的集成分类模型类似,只是对映射函数的构建施加了严格的约束,以避免投票过程。然后,通过集成粗糙集和支持向量机(SVM)分类方法,构建了所提出模型的两层实例,命名为 MRS(混合粗糙集和支持向量机),其中,粗糙集分类器作为第一层用于识别数据中的一些奇异样本,SVM 分类器作为第二层用于对剩余样本进行分类。最后,将 MRS 应用于两个公开的乳腺癌数据集进行预后预测。一个数据集,记为 BRC-1,是一个高质量的、公开可用的 97 个淋巴结阴性乳腺癌肿瘤数据集。另一个数据集,记为 BRC-2,使用来自 LBL 乳腺癌细胞系的基线人类原发性乳腺癌数据,包含 174 个样本。
结果
我们分别在 BRC-1 和 BRC-2 上进行了两项实验。在第一项实验中,将 BRC-1 数据集分为训练集(78 例,其中 34 例属于预后不良组,44 例属于预后良好组)和测试集(19 例,其中 12 例属于预后不良组,7 例属于预后良好组)。在训练集上训练后,MRS 可以正确分类测试集中的所有 12 例预后不良患者,以及 7 例预后良好患者中的 6 例。结果优于以前的研究,甚至优于基于 70 个基因的生物标志物。在第二项实验中,我们使用 BRC-2 数据集构建分类器,并在 Weka 软件中通过 5 折交叉验证与其他有代表性的方法进行比较,比较结果表明,MRS 比其他方法具有更高的预测准确性。
结论
所提出的混合分类模型可以轻松集成具有不同特征的方法。它可以克服传统基于投票的集成模型的缺点,从而可以充分利用临床数据中的信息。实验结果表明,我们实现的 MRS 分类器可以比以前的预后方法更准确地预测乳腺癌预后。