Present address: National Centre for Epidemiology & Population Health, Australian National University, Canberra, ACT 2601, Australia.
Pattern Recognition & Pathology, Department of Genome Sciences, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia.
BMC Med Inform Decis Mak. 2017 Aug 14;17(1):121. doi: 10.1186/s12911-017-0522-5.
Data mining techniques such as support vector machines (SVMs) have been successfully used to predict outcomes for complex problems, including for human health. Much health data is imbalanced, with many more controls than positive cases.
The impact of three balancing methods and one feature selection method is explored, to assess the ability of SVMs to classify imbalanced diagnostic pathology data associated with the laboratory diagnosis of hepatitis B (HBV) and hepatitis C (HCV) infections. Random forests (RFs) for predictor variable selection, and data reshaping to overcome a large imbalance of negative to positive test results in relation to HBV and HCV immunoassay results, are examined. The methodology is illustrated using data from ACT Pathology (Canberra, Australia), consisting of laboratory test records from 18,625 individuals who underwent hepatitis virus testing over the decade from 1997 to 2007.
Overall, the prediction of HCV test results by immunoassay was more accurate than for HBV immunoassay results associated with identical routine pathology predictor variable data. HBV and HCV negative results were vastly in excess of positive results, so three approaches to handling the negative/positive data imbalance were compared. Generating datasets by the Synthetic Minority Oversampling Technique (SMOTE) resulted in significantly more accurate prediction than single downsizing or multiple downsizing (MDS) of the dataset. For downsized data sets, applying a RF for predictor variable selection had a small effect on the performance, which varied depending on the virus. For SMOTE, a RF had a negative effect on performance. An analysis of variance of the performance across settings supports these findings. Finally, age and assay results for alanine aminotransferase (ALT), sodium for HBV and urea for HCV were found to have a significant impact upon laboratory diagnosis of HBV or HCV infection using an optimised SVM model.
Laboratories looking to include machine learning via SVM as part of their decision support need to be aware that the balancing method, predictor variable selection and the virus type interact to affect the laboratory diagnosis of hepatitis virus infection with routine pathology laboratory variables in different ways depending on which combination is being studied. This awareness should lead to careful use of existing machine learning methods, thus improving the quality of laboratory diagnosis.
数据挖掘技术,如支持向量机(SVM),已成功用于预测复杂问题的结果,包括人类健康问题。许多健康数据是不平衡的,对照例数远远多于阳性病例数。
本研究探索了三种平衡方法和一种特征选择方法的影响,以评估 SVM 对与乙型肝炎(HBV)和丙型肝炎(HCV)感染的实验室诊断相关的不平衡诊断病理学数据进行分类的能力。研究考察了随机森林(RFs)在预测变量选择中的应用,以及数据重塑以克服与 HBV 和 HCV 免疫测定结果相关的大量负阳性测试结果的不平衡。该方法使用 1997 年至 2007 年十年间 ACT 病理学(澳大利亚堪培拉)的 18625 名接受肝炎病毒检测的个体的实验室检测记录数据进行说明。
总体而言,与相同常规病理学预测变量数据相关的 HBV 免疫测定结果相比,免疫测定法对 HCV 检测结果的预测更准确。HBV 和 HCV 阴性结果大大超过阳性结果,因此比较了三种处理正负数据不平衡的方法。通过合成少数过采样技术(SMOTE)生成数据集比数据集的单一缩小或多次缩小(MDS)更能准确地预测。对于缩小的数据集,应用 RF 进行预测变量选择对性能的影响很小,这取决于病毒的不同而有所差异。对于 SMOTE,RF 对性能有负面影响。方差分析支持这些发现。最后,发现年龄和丙氨酸氨基转移酶(ALT)的检测结果、HBV 的钠和 HCV 的尿素对使用优化的 SVM 模型进行 HBV 或 HCV 感染的实验室诊断有显著影响。
希望将机器学习(通过 SVM)作为其决策支持的一部分的实验室需要意识到,平衡方法、预测变量选择以及病毒类型相互作用,以不同的方式影响使用常规病理学实验室变量对乙型肝炎病毒感染的实验室诊断,具体取决于正在研究的组合。这种认识应该导致对现有机器学习方法的谨慎使用,从而提高实验室诊断的质量。