Mao Lingchao, Wang Hairong, Hu Leland S, Tran Nhan L, Canoll Peter D, Swanson Kristin R, Li Jing
H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA.
Department of Radiology and the Mathematical Neuro-Oncology Laboratory, Department of Neurosurgery, Mayo Clinic Arizona, Phoenix, AZ 85054 USA.
IEEE Trans Autom Sci Eng. 2025;22:10008-10028. doi: 10.1109/tase.2024.3515839. Epub 2024 Dec 18.
Cancer remains one of the most challenging diseases to treat in the medical field. Machine learning (ML) has enabled in-depth analysis of complex patterns from large, diverse datasets, greatly facilitating "healthcare automation" in cancer diagnosis and prognosis. Despite these advancements, ML models face challenges stemming from limited labeled sample sizes, the intricate interplay of high-dimensionality data types, the inherent heterogeneity observed among patients and within tumors, and concerns about interpretability and consistency with existing biomedical knowledge. One approach to address these challenges is to integrate biomedical knowledge into data-driven models, which has proven potential to improve the accuracy, robustness, and interpretability of model results. Here, we review the state-of-the-art ML studies that leverage the fusion of biomedical knowledge and data, termed knowledge-informed machine learning (KIML), to advance cancer diagnosis and prognosis. We provide an overview of diverse forms of knowledge representation and current strategies of knowledge integration into machine learning pipelines with concrete examples. We conclude the review article by discussing future directions aimed at leveraging KIML to advance cancer research and healthcare automation. A live summary of the review is hosted at https://lingchm.github.io/kinformed-machine-learning-cancer/ offering an evolving resource to support research in this field.
癌症仍然是医学领域中最难治疗的疾病之一。机器学习(ML)能够对来自大型多样数据集的复杂模式进行深入分析,极大地推动了癌症诊断和预后方面的“医疗自动化”。尽管取得了这些进展,但ML模型面临着诸多挑战,包括标记样本量有限、高维数据类型之间复杂的相互作用、患者之间以及肿瘤内部存在的固有异质性,以及对模型可解释性和与现有生物医学知识一致性的担忧。解决这些挑战的一种方法是将生物医学知识整合到数据驱动的模型中,这已被证明有潜力提高模型结果的准确性、稳健性和可解释性。在此,我们回顾了利用生物医学知识与数据融合的前沿ML研究,即知识驱动的机器学习(KIML),以推进癌症诊断和预后。我们通过具体示例概述了不同形式的知识表示以及当前将知识整合到机器学习流程中的策略。我们通过讨论旨在利用KIML推进癌症研究和医疗自动化的未来方向来结束这篇综述文章。该综述的实时总结可在https://lingchm.github.io/kinformed-machine-learning-cancer/获取,为该领域的研究提供了不断更新的资源支持。