Suppr超能文献

基于非衍生串联质谱新生儿筛查数据的中链酰基辅酶 A 脱氢酶缺乏症(MCADD)分类数据挖掘方法。

Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data.

机构信息

i-ICT, University Hospital Antwerp, Wilrijkstraat 10, Edegem, Belgium.

出版信息

J Biomed Inform. 2011 Apr;44(2):319-25. doi: 10.1016/j.jbi.2010.12.001. Epub 2010 Dec 15.

Abstract

Newborn screening programs for severe metabolic disorders using tandem mass spectrometry are widely used. Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) is the most prevalent mitochondrial fatty acid oxidation defect (1:15,000 newborns) and it has been proven that early detection of this metabolic disease decreases mortality and improves the outcome. In previous studies, data mining methods on derivatized tandem MS datasets have shown high classification accuracies. However, no machine learning methods currently have been applied to datasets based on non-derivatized screening methods. A dataset with 44,159 blood samples was collected using a non-derivatized screening method as part of a systematic newborn screening by the PCMA screening center (Belgium). Twelve MCADD cases were present in this partially MCADD-enriched dataset. We extended three data mining methods, namely C4.5 decision trees, logistic regression and ridge logistic regression, with a parameter and threshold optimization method and evaluated their applicability as a diagnostic support tool. Within a stratified cross-validation setting, a grid search was performed for each model for a wide range of model parameters, included variables and classification thresholds. The best performing model used ridge logistic regression and achieved a sensitivity of 100%, a specificity of 99.987% and a positive predictive value of 32% (recalibrated for a real population), obtained in a stratified cross-validation setting. These results were further validated on an independent test set. Using a method that combines ridge logistic regression with variable selection and threshold optimization, a significantly improved performance was achieved compared to the current state-of-the-art for derivatized data, while retaining more interpretability and requiring less variables. The results indicate the potential value of data mining methods as a diagnostic support tool.

摘要

串联质谱法广泛用于严重代谢紊乱的新生儿筛查项目。中链酰基辅酶 A 脱氢酶缺乏症(MCADD)是最常见的线粒体脂肪酸氧化缺陷(每 15000 个新生儿中有 1 个),已证明早期发现这种代谢疾病可降低死亡率并改善预后。在以前的研究中,衍生串联 MS 数据集的数据挖掘方法已显示出较高的分类准确性。然而,目前尚无机器学习方法应用于基于非衍生筛选方法的数据集。PCMA 筛查中心(比利时)采用非衍生筛选方法收集了一个包含 44159 个血样的数据集,作为系统新生儿筛查的一部分。该部分 MCADD 富集数据集中存在 12 例 MCADD 病例。我们扩展了三种数据挖掘方法,即 C4.5 决策树、逻辑回归和岭逻辑回归,并使用参数和阈值优化方法对其进行了评估,以确定其作为诊断支持工具的适用性。在分层交叉验证设置中,针对每个模型的广泛模型参数、包含变量和分类阈值进行了网格搜索。表现最佳的模型使用岭逻辑回归,在分层交叉验证设置中获得了 100%的敏感性、99.987%的特异性和 32%的阳性预测值(针对真实人群进行了重新校准)。这些结果在独立测试集上得到了进一步验证。使用一种将岭逻辑回归与变量选择和阈值优化相结合的方法,与衍生数据的最新技术相比,性能得到了显著提高,同时保持了更高的可解释性,并且所需变量更少。结果表明,数据挖掘方法作为一种诊断支持工具具有潜在价值。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验