Wang Dong, Lv Yingli, Guo Zheng, Li Xia, Li Yanhui, Zhu Jing, Yang Da, Xu Jianzhen, Wang Chenguang, Rao Shaoqi, Yang Baofeng
Department of Bioinformatics and Bio-pharmaceutical Key Laboratory of Heilongjiang Province and State, Harbin Medical University Harbin 150086, China.
Bioinformatics. 2006 Dec 1;22(23):2883-9. doi: 10.1093/bioinformatics/btl339. Epub 2006 Jun 29.
Microarrays datasets frequently contain a large number of missing values (MVs), which need to be estimated and replaced for subsequent data mining. The focus of the paper is to study the effects of different MV treatments for cDNA microarray data on disease classification analysis.
By analyzing five datasets, we demonstrate that among three kinds of classifiers evaluated in this study, support vector machine (SVM) classifiers are robust to varied MV imputation methods [e.g. replacing MVs by zero, K nearest-neighbor (KNN) imputation algorithm, local least square imputation and Bayesian principal component analysis], while the classification and regression tree classifiers are sensitive in terms of classification accuracy. The KNNclassifiers built on differentially expressed genes (DEGs) are robust to the varied MV treatments, but the performances of the KNN classifiers based on all measured genes can be significantly deteriorated when imputing MVs for genes with larger missing rate (MR) (e.g. MR > 5%). Generally, while replacing MVs by zero performs relatively poor, the other imputation algorithms have little difference in affecting classification performances of the SVM or KNN classifiers. We further demonstrate the power and feasibility of our recently proposed functional expression profile (FEP) approach as means to handle microarray data with MVs. The FEPs, which are derived from the functional modules that are enriched with sets of DEGs and thus can be consistently identified under varied MV treatments, achieve precise disease classification with better biological interpretation. We conclude that the choice of MV treatments should be determined in context of the later approaches used for disease classification. The suggested exclusion criterion of ignoring the genes with larger MR (e.g. >5%), while justifiable for some classifiers such as KNN classifiers, might not be considered as a general rule for all classifiers.
微阵列数据集经常包含大量缺失值,在后续数据挖掘之前需要对这些缺失值进行估计和替换。本文的重点是研究不同的缺失值处理方法对cDNA微阵列数据疾病分类分析的影响。
通过分析五个数据集,我们证明,在本研究评估的三种分类器中,支持向量机(SVM)分类器对各种缺失值插补方法(例如用零替换缺失值、K近邻(KNN)插补算法、局部最小二乘插补和贝叶斯主成分分析)具有鲁棒性,而分类与回归树分类器在分类准确性方面较为敏感。基于差异表达基因(DEG)构建的KNN分类器对各种缺失值处理方法具有鲁棒性,但是当对缺失率较高(例如缺失率>5%)的基因插补缺失值时,基于所有测量基因的KNN分类器的性能可能会显著下降。一般来说,用零替换缺失值的效果相对较差,而其他插补算法对SVM或KNN分类器分类性能的影响差异不大。我们进一步证明了我们最近提出的功能表达谱(FEP)方法作为处理含有缺失值的微阵列数据手段的有效性和可行性。功能表达谱源自富含差异表达基因集的功能模块,因此在各种缺失值处理方法下都能被一致识别,它能够实现精确的疾病分类,并具有更好的生物学解释。我们得出结论,缺失值处理方法的选择应根据后续用于疾病分类的方法来确定。建议的忽略缺失率较高(例如>5%)基因的排除标准,虽然对某些分类器(如KNN分类器)是合理的,但可能不能被视为所有分类器的通用规则。