Zou Meng, Zhang Peng-Jun, Wen Xin-Yu, Chen Luonan, Tian Ya-Ping, Wang Yong
National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China.
Department of Clinical Biochemistry, State Key Laboratory of Kidney Disease, Chinese PLA General Hospital, Beijing 100853, China.
Methods. 2015 Jul 15;83:3-17. doi: 10.1016/j.ymeth.2015.05.011. Epub 2015 May 15.
Multi-biomarker panels can capture the nonlinear synergy among biomarkers and they are important to aid in the early diagnosis and ultimately battle complex diseases. However, identification of these multi-biomarker panels from case and control data is challenging. For example, the exhaustive search method is computationally infeasible when the data dimension is high. Here, we propose a novel method, MILP_k, to identify serum-based multi-biomarker panel to distinguish colorectal cancers (CRC) from benign colorectal tumors. Specifically, the multi-biomarker panel detection problem is modeled by a mixed integer programming to maximize the classification accuracy. Then we measured the serum profiling data for 101 CRC patients and 95 benign patients. The 61 biomarkers were analyzed individually and further their combinations by our method. We discovered 4 biomarkers as the optimal small multi-biomarker panel, including known CRC biomarkers CEA and IL-10 as well as novel biomarkers IMA and NSE. This multi-biomarker panel obtains leave-one-out cross-validation (LOOCV) accuracy to 0.7857 by nearest centroid classifier. An independent test of this panel by support vector machine (SVM) with threefold cross validation gets an AUC 0.8438. This greatly improves the predictive accuracy by 20% over the single best biomarker. Further extension of this 4-biomarker panel to a larger 13-biomarker panel improves the LOOCV to 0.8673 with independent AUC 0.8437. Comparison with the exhaustive search method shows that our method dramatically reduces the searching time by 1000-fold. Experiments on the early cancer stage samples reveal two panel of biomarkers and show promising accuracy. The proposed method allows us to select the subset of biomarkers with best accuracy to distinguish case and control samples given the number of selected biomarkers. Both receiver operating characteristic curve and precision-recall curve show our method's consistent performance gain in accuracy. Our method also shows its advantage in capturing synergy among selected biomarkers. The multi-biomarker panel far outperforms the simple combination of best single features. Close investigation of the multi-biomarker panel illustrates that our method possesses the ability to remove redundancy and reveals complementary biomarker combinations. In addition, our method is efficient and can select multi-biomarker panel with more than 5 biomarkers, for which the exhaustive methods fail. In conclusion, we propose a promising model to improve the clinical data interpretability and to serve as a useful tool for other complex disease studies. Our small multi-biomarker panel, CEA, IL-10, IMA, and NSE, may provide insights on the disease status of colorectal diseases. The implementation of our method in MATLAB is available via the website: http://doc.aporc.org/wiki/MILP_k.
多生物标志物组合能够捕捉生物标志物之间的非线性协同作用,对于辅助早期诊断以及最终攻克复杂疾病具有重要意义。然而,从病例和对照数据中识别这些多生物标志物组合具有挑战性。例如,当数据维度较高时,穷举搜索方法在计算上是不可行的。在此,我们提出一种新颖的方法MILP_k,用于识别基于血清的多生物标志物组合,以区分结直肠癌(CRC)与良性结直肠肿瘤。具体而言,多生物标志物组合检测问题通过混合整数规划进行建模,以最大化分类准确率。然后,我们测量了101例CRC患者和95例良性患者的血清谱数据。对61种生物标志物进行了单独分析,并进一步通过我们的方法分析了它们的组合。我们发现4种生物标志物作为最佳的小型多生物标志物组合,包括已知的CRC生物标志物CEA和IL-10以及新型生物标志物IMA和NSE。通过最近邻质心分类器,该多生物标志物组合的留一法交叉验证(LOOCV)准确率达到0.7857。使用支持向量机(SVM)进行三倍交叉验证对该组合进行独立测试,得到的AUC为0.8438。这比单一最佳生物标志物的预测准确率大幅提高了20%。将这个4生物标志物组合进一步扩展为更大的13生物标志物组合,LOOCV提高到0.8673,独立AUC为0.8437。与穷举搜索方法的比较表明,我们的方法将搜索时间大幅缩短了1000倍。对早期癌症阶段样本的实验揭示了两组生物标志物,并且显示出有前景的准确率。所提出的方法使我们能够在给定所选生物标志物数量的情况下,选择具有最佳准确率的生物标志物子集来区分病例和对照样本。接收者操作特征曲线和精确召回曲线均显示我们的方法在准确率方面具有一致的性能提升。我们的方法在捕捉所选生物标志物之间的协同作用方面也显示出优势。多生物标志物组合远优于最佳单一特征的简单组合。对多生物标志物组合的仔细研究表明,我们的方法具有去除冗余的能力,并揭示了互补的生物标志物组合。此外,我们的方法效率高,能够选择超过5种生物标志物的多生物标志物组合,而穷举方法在这方面则无能为力。总之,我们提出了一个有前景的模型,以提高临床数据的可解释性,并作为其他复杂疾病研究的有用工具。我们的小型多生物标志物组合CEA、IL-10、IMA和NSE可能为结直肠疾病的病情提供见解。我们方法的MATLAB实现可通过网站获取:http://doc.aporc.org/wiki/MILP_k。