The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China.
20/20 GeneSystems, Inc, Gaithersburg, MD 20877, USA; Department of Laboratory Medicine, Chang Gung Memorial Hospital at Linkou, Taoyuan City, 33305, Taiwan; PhD Program in Biomedical Engineering, Chang Gung University, Taoyuan City, 33301, Taiwan.
Comput Biol Med. 2022 May;144:105362. doi: 10.1016/j.compbiomed.2022.105362. Epub 2022 Mar 9.
Machine learning (ML) has emerged as a superior method for the analysis of large datasets. Application of ML is often hindered by incompleteness of the data which is particularly evident when approaching disease screening data due to varied testing regimens across medical institutions. Here we explored the utility of multiple ML algorithms to predict cancer risk when trained using a large but incomplete real-world dataset of tumor marker (TM) values.
TM screening data were collected from a large asymptomatic cohort (n = 163,174) at two independent medical centers. The cohort included 785 individuals who were subsequently diagnosed with cancer. Data included levels of up to eight TMs, but for most subjects, only a subset of the biomarkers were tested. In some instances, TM values were available at multiple time points, but intervals between tests varied widely. The data were used to train and test various machine learning models to evaluate their robustness for predicting cancer risk. Multiple methods for data imputation were explored and models were developed for both single time-point as well as time-series data.
The ML algorithm, long short-term memory (LSTM), demonstrated superiority over other models for dealing with irregular medical data. A cancer risk prediction tool was trained and validated for a single time-point test of a TM panel including up to four biomarkers (AUROC = 0.831, 95% CI: 0.827-0.835) which outperformed a single threshold method using the same biomarkers. A second model relying on time series data of up to four time-points for 5 TMs had an AUROC of 0.931.
A cancer risk prediction tool was developed by training a LSTM model using a large but incomplete real-world dataset of TM values. The LSTM model was best able to handle irregular data compared to other ML models. The use of time-series TM data can further improve the predictive performance of LSTM models even when the intervals between tests vary widely. These risk prediction tools are useful to direct subjects to further screening sooner, resulting in earlier detection of occult tumors.
机器学习 (ML) 已成为分析大型数据集的卓越方法。由于医疗机构之间的检测方案不同,数据的不完整性极大地阻碍了 ML 的应用,尤其是在接近疾病筛查数据时。在这里,我们探索了使用多种 ML 算法在使用大型但不完整的肿瘤标志物 (TM) 值真实世界数据集进行训练时预测癌症风险的效用。
从两个独立的医疗中心的一个大型无症状队列 (n = 163,174) 中收集 TM 筛查数据。该队列包括 785 名随后被诊断患有癌症的个体。数据包括多达 8 个 TM 的水平,但对于大多数受试者,只有部分生物标志物进行了检测。在某些情况下,TM 值可在多个时间点获得,但测试之间的间隔差异很大。使用这些数据来训练和测试各种机器学习模型,以评估它们预测癌症风险的稳健性。探索了多种数据插补方法,并为单时间点和时间序列数据开发了模型。
ML 算法,长短期记忆 (LSTM),在处理不规则医学数据方面表现优于其他模型。为单个时间点的 TM 面板测试(最多包括四个生物标志物)训练和验证了癌症风险预测工具 (AUROC = 0.831,95%CI:0.827-0.835),优于使用相同生物标志物的单个阈值方法。依赖于多达四个时间点的五个 TM 的时间序列数据的第二个模型的 AUROC 为 0.931。
使用 TM 值的大型但不完整的真实世界数据集训练 LSTM 模型开发了癌症风险预测工具。与其他 ML 模型相比,LSTM 模型最能够处理不规则数据。即使测试之间的间隔差异很大,使用时间序列 TM 数据也可以进一步提高 LSTM 模型的预测性能。这些风险预测工具有助于更早地指导受试者进行进一步筛查,从而更早地发现隐匿性肿瘤。