Suppr
超能文献

检测受试者特征对基于机器学习的诊断应用的影响。

Detecting the impact of subject characteristics on machine learning-based diagnostic applications.

作者信息

Chaibub Neto Elias, Pratap Abhishek, Perumal Thanneer M, Tummalacherla Meghasyam, Snyder Phil, Bot Brian M, Trister Andrew D, Friend Stephen H, Mangravite Lara, Omberg Larsson

机构信息

1Sage Bionetworks, Seattle, USA.

2Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, USA.

出版信息

NPJ Digit Med. 2019 Oct 11;2:99. doi: 10.1038/s41746-019-0178-x. eCollection 2019.

DOI:10.1038/s41746-019-0178-x

PMID:31633058

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6789029/

Abstract

Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets ("record-wise" data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of "identity confounding." In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided.

摘要

收集高维纵向数字健康数据有潜力支持多种研究和临床应用，包括诊断和纵向健康跟踪。处理这些数据并为数字诊断提供信息的算法通常是使用从一组个体收集的多个重复测量生成的训练集和测试集开发的。然而，在预测性能的分析评估中，重复测量的纳入并非总是得到适当考虑。将每个个体的重复测量同时分配到训练集和测试集（“逐记录”数据拆分）是一种常见做法，由于存在“身份混淆”，可能会导致对预测误差的严重低估。本质上，除了诊断信号外，这些模型还学会了识别个体。在这里，我们提出了一种方法，可用于有效计算使用逐记录数据拆分开发的分类器所学到的身份混淆量。通过将此方法应用于几个真实数据集，我们证明身份混淆在数字健康研究中是一个严重问题，并且基于机器学习的应用需要避免逐记录数据拆分。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b2b4/6789029/2a0f6a6edd94/41746_2019_178_Fig1_HTML.jpg

相似文献

Detecting the impact of subject characteristics on machine learning-based diagnostic applications.

NPJ Digit Med. 2019 Oct 11;2:99. doi: 10.1038/s41746-019-0178-x. eCollection 2019.

Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications.

Healthc Inform Res. 2021 Jul;27(3):189-199. doi: 10.4258/hir.2021.27.3.189. Epub 2021 Jul 31.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Estimating individual minimum calibration for deep-learning with predictive performance recovery: An example case of gait surface classification from wearable sensor gait data.

J Biomech. 2023 Jun;154:111606. doi: 10.1016/j.jbiomech.2023.111606. Epub 2023 Apr 30.

Biased Deep Learning Methods in Detection of COVID-19 Using CT Images: A Challenge Mounted by Subject-Wise-Split ISFCT Dataset.

J Imaging. 2023 Aug 8;9(8):159. doi: 10.3390/jimaging9080159.

Investigation of a Data Split Strategy Involving the Time Axis in Adverse Event Prediction Using Machine Learning.

J Chem Inf Model. 2022 Sep 12;62(17):3982-3992. doi: 10.1021/acs.jcim.2c00765. Epub 2022 Aug 16.

Can Predictive Modeling Tools Identify Patients at High Risk of Prolonged Opioid Use After ACL Reconstruction?

Clin Orthop Relat Res. 2020 Jul;478(7):0-1618. doi: 10.1097/CORR.0000000000001251.

EMLI-ICC: an ensemble machine learning-based integration algorithm for metastasis prediction and risk stratification in intrahepatic cholangiocarcinoma.

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac450.

The future of Cochrane Neonatal.

Early Hum Dev. 2020 Nov;150:105191. doi: 10.1016/j.earlhumdev.2020.105191. Epub 2020 Sep 12.

Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art.

BMC Bioinformatics. 2012 May 10;13:89. doi: 10.1186/1471-2105-13-89.

引用本文的文献

Our Theater of Anonymity.

Ethics Hum Res. 2025 Jul-Aug;47(4):37-42. doi: 10.1002/eahr.60027.

Systematic review of AI/ML applications in multi-domain robotic rehabilitation: trends, gaps, and future directions.

J Neuroeng Rehabil. 2025 Apr 9;22(1):79. doi: 10.1186/s12984-025-01605-z.

Colon Tumor Discrimination Combining Independent Endoscopic Probe-Based Raman Spectroscopy and Optical Coherence Tomography Modalities with Bayes Rule.

Int J Mol Sci. 2024 Dec 11;25(24):13306. doi: 10.3390/ijms252413306.

Transformer-based transfer learning on self-reported voice recordings for Parkinson's disease diagnosis.

Sci Rep. 2024 Dec 3;14(1):30131. doi: 10.1038/s41598-024-81824-x.

Data leakage in deep learning studies of translational EEG.

Front Neurosci. 2024 May 3;18:1373515. doi: 10.3389/fnins.2024.1373515. eCollection 2024.

Harnessing Consumer Wearable Digital Biomarkers for Individualized Recognition of Postpartum Depression Using the All of Us Research Program Data Set: Cross-Sectional Study.

JMIR Mhealth Uhealth. 2024 May 2;12:e54622. doi: 10.2196/54622.

Machine Learning in the Parkinson's disease smartwatch (PADS) dataset.

NPJ Parkinsons Dis. 2024 Jan 5;10(1):9. doi: 10.1038/s41531-023-00625-7.

Harnessing consumer wearable digital biomarkers for individualized recognition of postpartum depression using the Research Program dataset.

medRxiv. 2023 Oct 14:2023.10.13.23296965. doi: 10.1101/2023.10.13.23296965.

Biased Deep Learning Methods in Detection of COVID-19 Using CT Images: A Challenge Mounted by Subject-Wise-Split ISFCT Dataset.

J Imaging. 2023 Aug 8;9(8):159. doi: 10.3390/jimaging9080159.

Performance of multiple neural networks in predicting lower limb joint moments using wearable sensors.

Front Bioeng Biotechnol. 2023 Jul 31;11:1215770. doi: 10.3389/fbioe.2023.1215770. eCollection 2023.

本文引用的文献

The First Frontier: Digital Biomarkers for Neurodegenerative Disorders.

Digit Biomark. 2017 Jul 4;1(1):6-13. doi: 10.1159/000477383. eCollection 2017 Sep-Dec.

Digital biomarkers of cognitive function.

NPJ Digit Med. 2018 Mar 28;1:10. doi: 10.1038/s41746-018-0018-4. eCollection 2018.

Digital biomarkers for Alzheimer's disease: the mobile/ wearable devices opportunity.

NPJ Digit Med. 2019;2. doi: 10.1038/s41746-019-0084-2. Epub 2019 Feb 21.

Developing and adopting safe and effective digital biomarkers to improve patient outcomes.

NPJ Digit Med. 2019;2(1). doi: 10.1038/s41746-019-0090-4. Epub 2019 Mar 11.

Feasibility of Reidentifying Individuals in Large National Physical Activity Data Sets From Which Protected Health Information Has Been Removed With Use of Machine Learning.

JAMA Netw Open. 2018 Dec 7;1(8):e186040. doi: 10.1001/jamanetworkopen.2018.6040.

Use of Mobile Devices to Measure Outcomes in Clinical Research, 2010-2016: A Systematic Literature Review.

Digit Biomark. 2018 Jan 31;2(1):11-30. doi: 10.1159/000486347. eCollection 2018 Jan-Apr.

Evaluation of smartphone-based testing to generate exploratory outcome measures in a phase 1 Parkinson's disease clinical trial.

Mov Disord. 2018 Aug;33(8):1287-1297. doi: 10.1002/mds.27376. Epub 2018 Apr 27.

Smartphones as new tools in the management and understanding of Parkinson's disease.

NPJ Parkinsons Dis. 2016 Mar 3;2:16006. doi: 10.1038/npjparkd.2016.6. eCollection 2016.

Using and understanding cross-validation strategies. Perspectives on Saeb et al.

Gigascience. 2017 May 1;6(5):1-6. doi: 10.1093/gigascience/gix020.

The need to approximate the use-case in clinical machine learning.

Gigascience. 2017 May 1;6(5):1-9. doi: 10.1093/gigascience/gix019.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

检测受试者特征对基于机器学习的诊断应用的影响。

Detecting the impact of subject characteristics on machine learning-based diagnostic applications.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译