Li Haomin, Gao Siyuan, Wu Dan, Zhu Min, Hu Zhenzhen, Fang Kexin, Chen Xiuru, Ni Zhou, Li Jing, Zhao Beibei, She Xuhui, Huang Xinwen
Clinical Data Center, the Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou 310052, China.
Digital Management Center, Guangzhou KingMed Diagnostics Group Co., Ltd., Guangzhou 510005, China.
Int J Med Inform. 2025 Mar;195:105765. doi: 10.1016/j.ijmedinf.2024.105765. Epub 2024 Dec 16.
Gas chromatography-mass spectrometry (GC-MS) has been shown to be a potentially efficient metabolic profiling platform in urine analysis. However, the widespread use of GC-MS for inborn errors of metabolism (IEM) screening is constrained by the rarity of IEM in population, and the difficult and specialized complexity of the interpretation of GC-MS organic acid profiles.
Based on 355,197 GC-MS test cases accumulated from 2013 to 2021 in China, a random forest-based machine learning model was proposed, trained, and evaluated. Weighted undersampling or oversampling data processing and staged modeling strategies were used to handle the highly imbalanced data and improve the ability of the model to identify different types of rare IEM cases.
In the first-stage model, which only identified positive cases without discriminating the specific IEM, the screening sensitivity was 0.938 (or 0.991 if abnormal cases were also included). The average sensitivity of the second-stage models that classify 11 particular IEMs is 0.992, with an average specificity and accuracy of 0.944 and 0.969, respectively. The SHAP values visualized for each model explain the basis for the differential diagnosis made by the model.
With sufficient high-quality data, machine learning models can provide high-sensitivity GC-MS interpretation and greatly improve the efficiency and quality of GC-MS based IEM screening.
气相色谱 - 质谱联用(GC-MS)已被证明是尿液分析中一个潜在的高效代谢谱分析平台。然而,GC-MS在先天性代谢缺陷(IEM)筛查中的广泛应用受到人群中IEM发病率低以及GC-MS有机酸谱解释困难且专业复杂的限制。
基于2013年至2021年在中国积累的355,197个GC-MS检测病例,提出、训练并评估了一种基于随机森林的机器学习模型。采用加权欠采样或过采样数据处理以及分阶段建模策略来处理高度不平衡的数据,并提高模型识别不同类型罕见IEM病例的能力。
在仅识别阳性病例而不区分特定IEM的第一阶段模型中,筛查灵敏度为0.938(如果也包括异常病例则为0.991)。对11种特定IEM进行分类的第二阶段模型的平均灵敏度为0.992,平均特异性和准确性分别为0.944和0.969。为每个模型可视化的SHAP值解释了模型进行鉴别诊断的依据。
有了足够的高质量数据,机器学习模型可以提供高灵敏度的GC-MS解释,并大大提高基于GC-MS的IEM筛查的效率和质量。