Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California.
Department of Pediatrics, Stanford University School of Medicine, Stanford, California.
JAMA Netw Open. 2022 Aug 1;5(8):e2227779. doi: 10.1001/jamanetworkopen.2022.27779.
Various model reporting guidelines have been proposed to ensure clinical prediction models are reliable and fair. However, no consensus exists about which model details are essential to report, and commonalities and differences among reporting guidelines have not been characterized. Furthermore, how well documentation of deployed models adheres to these guidelines has not been studied.
To assess information requested by model reporting guidelines and whether the documentation for commonly used machine learning models developed by a single vendor provides the information requested.
MEDLINE was queried using machine learning model card and reporting machine learning from November 4 to December 6, 2020. References were reviewed to find additional publications, and publications without specific reporting recommendations were excluded. Similar elements requested for reporting were merged into representative items. Four independent reviewers and 1 adjudicator assessed how often documentation for the most commonly used models developed by a single vendor reported the items.
From 15 model reporting guidelines, 220 unique items were identified that represented the collective reporting requirements. Although 12 items were commonly requested (requested by 10 or more guidelines), 77 items were requested by just 1 guideline. Documentation for 12 commonly used models from a single vendor reported a median of 39% (IQR, 37%-43%; range, 31%-47%) of items from the collective reporting requirements. Many of the commonly requested items had 100% reporting rates, including items concerning outcome definition, area under the receiver operating characteristics curve, internal validation, and intended clinical use. Several items reported half the time or less related to reliability, such as external validation, uncertainty measures, and strategy for handling missing data. Other frequently unreported items related to fairness (summary statistics and subgroup analyses, including for race and ethnicity or sex).
These findings suggest that consistent reporting recommendations for clinical predictive models are needed for model developers to share necessary information for model deployment. The many published guidelines would, collectively, require reporting more than 200 items. Model documentation from 1 vendor reported the most commonly requested items from model reporting guidelines. However, areas for improvement were identified in reporting items related to model reliability and fairness. This analysis led to feedback to the vendor, which motivated updates to the documentation for future users.
已经提出了各种模型报告指南,以确保临床预测模型是可靠和公平的。然而,对于哪些模型细节是报告所必需的,尚未达成共识,并且报告指南之间的共同点和差异尚未得到描述。此外,尚未研究部署模型的文档对这些指南的遵守情况。
评估模型报告指南中所要求的信息,以及单个供应商开发的常用机器学习模型的文档是否提供了所要求的信息。
使用机器学习模型卡和报告机器学习于 2020 年 11 月 4 日至 12 月 6 日在 MEDLINE 中进行了查询。审查了参考文献以找到其他出版物,并排除了没有具体报告建议的出版物。为报告而合并的相似要求被合并为代表性项目。四名独立评审员和一名裁决员评估了单个供应商开发的最常用模型的文档报告了多少项内容。
从 15 项模型报告指南中,确定了 220 个独特的项目,这些项目代表了集体报告要求。虽然有 12 项要求是共同要求(被 10 项或更多指南要求),但还有 77 项只被 1 项指南要求。单个供应商的 12 种常用模型的文档报告了 39%(中位数,IQR,37%-43%;范围,31%-47%)的集体报告要求的项目。许多常用的要求项有 100%的报告率,包括与结局定义、接收者操作特征曲线下面积、内部验证和预期临床用途有关的项目。一些与可靠性相关的项目报告率为一半或更少,例如外部验证、不确定性度量和处理缺失数据的策略。其他与公平性相关的频繁未报告项目(摘要统计数据和亚组分析,包括种族和民族或性别)。
这些发现表明,需要为模型开发人员制定一致的临床预测模型报告建议,以便共享模型部署所需的必要信息。这众多的已发表指南将总共需要报告 200 多项内容。来自 1 个供应商的模型文档报告了模型报告指南中最常要求的项目。但是,在报告与模型可靠性和公平性相关的项目方面,仍有改进的空间。这项分析为供应商提供了反馈,促使他们对未来用户更新了文档。