Adabor Emmanuel S, Acquaah-Mensah George K, Mazandu Gaston K
School of Technology, Ghana Institute of Management and Public Administration, Accra, Ghana.
Pharmaceutical Sciences Department, Massachusetts College of Pharmacy and Health Sciences, Worcester, MA, USA.
F1000Res. 2020 Sep 10;9:1114. doi: 10.12688/f1000research.25501.1. eCollection 2020.
High-throughput technologies have resulted in an exponential growth of publicly available and accessible datasets for biomedical research. Efficient computational models, algorithms and tools are required to exploit the datasets for knowledge discovery to aid medical decisions. Here, we introduce a new tool, MSclassifier, based on median-supplement approaches to machine learning to enable an automated and effective binary classification for optimal decision making. The MSclassifier package estimates medians of features (attributes) to deduce supplementary data, which is subsequently introduced into the training set for balancing and building superior models for classification. To test our approach, it is used to determine HER2 receptor expression status phenotypes in breast cancer and also predict protein subcellular localization (plasma membrane and nucleus). Using independent sample and cross-validation tests, the performance of MSclassifier is evaluated and compared with well established tools that could perform such tasks. In the HER2 receptor expression status phenotype identification tasks, MSclassifier achieved statistically significant higher classification rates than the best performing existing tool (90.30% versus 89.83%, p=8.62e-3). In the subcellular localization prediction tasks, MSclassifier and one other existing tool achieved equally high performances (93.42% versus 93.19%, p=0.06) although they both outperformed tools based on Naive Bayes classifiers. Overall, the application and evaluation of MSclassifier reveal its potential to be applied to varieties of binary classification problems. The MSclassifier package provides an R-portable and user-friendly application to a broad audience, enabling experienced end-users as well as non-programmers to perform an effective classification in biomedical and other fields of study.
高通量技术使得生物医学研究中可公开获取的数据集呈指数级增长。需要高效的计算模型、算法和工具来利用这些数据集进行知识发现,以辅助医疗决策。在此,我们介绍一种新工具MSclassifier,它基于机器学习的中位数补充方法,能够实现自动化且有效的二元分类,以做出最优决策。MSclassifier软件包估计特征(属性)的中位数以推导补充数据,随后将其引入训练集以进行平衡并构建用于分类的优质模型。为了测试我们的方法,将其用于确定乳腺癌中HER2受体表达状态表型,还用于预测蛋白质亚细胞定位(质膜和细胞核)。通过独立样本和交叉验证测试,对MSclassifier的性能进行评估,并与能够执行此类任务的成熟工具进行比较。在HER2受体表达状态表型识别任务中,MSclassifier实现的分类率在统计学上显著高于表现最佳的现有工具(90.30%对89.83%,p = 8.62e - 3)。在亚细胞定位预测任务中,MSclassifier和另一种现有工具实现了同样高的性能(93.42%对93.19%,p = 0.06),尽管它们都优于基于朴素贝叶斯分类器的工具。总体而言,MSclassifier的应用和评估揭示了其应用于各种二元分类问题的潜力。MSclassifier软件包为广大用户提供了一个R可移植且用户友好的应用程序,使有经验的终端用户以及非程序员能够在生物医学和其他研究领域进行有效的分类。