Li Jinpeng, Tao Yaling, Cong Huaiwei, Zhu Enwei, Cai Ting
Ningbo HwaMei Hospital, University of Chinese Academy of Sciences, Ningbo, Zhejiang 315010, China; Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, Zhejiang 315010, China.
Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, Zhejiang 315010, China.
Artif Intell Med. 2022 Feb;124:102234. doi: 10.1016/j.artmed.2021.102234. Epub 2022 Jan 6.
Liver Cancer is a threat to human health and life over the world. The key to reduce liver cancer incidence is to identify high-risk populations and carry out individualized interventions before cancer occurrence. Building predictive models based on machine learning algorithms is an effective and economical way to forecast potential liver cancers. However, since the dataset is usually extremely skewed (negative samples are much more than positive samples), machine learning models suffer from severe bias and make unreliable predictions. In this paper, we systematically evaluate existing approaches in tackling class-imbalance problem and introduce two undersampling methods. The first is based on K-means++, where robust clustering centers are appointed as negative samples. The second is based on learning vector quantization, which considers diagnostic labels during clustering, and the prototypes are used as negative data. In this way, positive and negative samples are rebalanced. The algorithm is applied to five-year liver cancer prediction in Early Diagnosis and Treatment of Urban Cancer project in China. We achieve an AUC of 0.76 when no clinical measure except for epidemiological information is used. Experimental results show the advantage of our method over existing oversampling, undersampling, ensemble algorithms, and state-of-the-art outlier detection algorithms. This work explores a feasible and practical roadmap to tackle skewed medical data in cancer prediction and benefits applications targeted to human health and well-being.
肝癌对全球人类健康和生命构成威胁。降低肝癌发病率的关键在于识别高危人群,并在癌症发生前进行个体化干预。基于机器学习算法构建预测模型是预测潜在肝癌的一种有效且经济的方法。然而,由于数据集通常极度不均衡(负样本远多于正样本),机器学习模型存在严重偏差,预测结果不可靠。在本文中,我们系统地评估了处理类不平衡问题的现有方法,并介绍了两种欠采样方法。第一种基于K-means++,其中稳健的聚类中心被指定为负样本。第二种基于学习向量量化,在聚类过程中考虑诊断标签,并将原型用作负数据。通过这种方式,实现了正负样本的重新平衡。该算法应用于中国城市癌症早诊早治项目中的五年期肝癌预测。当仅使用流行病学信息而不使用其他临床指标时,我们实现了0.76的曲线下面积(AUC)。实验结果表明了我们的方法相对于现有过采样、欠采样、集成算法以及最先进的异常值检测算法的优势。这项工作探索了一条在癌症预测中处理不均衡医疗数据的可行且实用的路线图,并有益于针对人类健康和福祉的应用。