Suppr超能文献

一种新的人工智能辅助数据标准加速了生物医学研究中的互操作性。

A new AI-assisted data standard accelerates interoperability in biomedical research.

作者信息

Long Rodney Alan, Ballard Shannon, Shah Syed, Bianchi Owen, Jones Lietsel, Koretsky Mathew J, Kuznetsov Nicole, Marsan Elise, Jen Bryant, Chiang Philip, Mukherjee Abhradeep, Blauwendraat Cornelis, Leonard Hampton, Vitale Dan, Levine Kristin, Bandres-Ciga Sara, Jarreau Paige, Brannelly Patrick, Pantazis Caroline, Screven Laurel, Andersh Kate, Kapasi Alifiya, Crary John F, Gutman David, Dugger Brittany N, Biber Sarah, Hohman Timothy, Faghri Faraz, Griswold Michael, Sargent Lana, van Keuren-Jensen Kendall, Singleton Andrew B, Fann Yang, Nalls Mike A, Iwaki Hirotaka

机构信息

Center for Alzheimer's and Related Dementias, National Institute on Aging, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD, USA.

Data Tecnica LLC, Washington, DC, USA.

出版信息

medRxiv. 2024 Nov 7:2024.10.17.24315618. doi: 10.1101/2024.10.17.24315618.

Abstract

In this paper, we leveraged Large Language Models(LLMs) to accelerate data wrangling and automate labor-intensive aspects of data discovery and harmonization. This work promotes interoperability standards and enhances data discovery, facilitating AI-readiness in biomedical science with the generation of Common Data Elements (CDEs) as key to harmonizing multiple datasets. Thirty-one studies, various ontologies, and medical coding systems served as source material to create CDEs from which available metadata and context was sent as an API request to 4th-generation OpenAI GPT models to populate each metadata field. A human-in-the-loop (HITL) approach was used to assess quality and accuracy of the generated CDEs. To regulate CDE generation, we employed ElasticSearch and HITL to avoid duplicate CDEs and instead, added them as potential aliases for existing CDEs. The generated CDEs are foundational to assess the interoperability potential of datasets by determining how many data set column headers can be correctly mapped to CDEs as well as quantifying compliance with permissible values and data types. Subject matter experts reviewed generated CDEs and determined that 94.0% of generated metadata fields did not require manual revisions. Data tables from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Global Parkinson's Genetic Program (GP2) were used as test cases for interoperability assessments. Column headers from all test cases were successfully mapped to generated CDEs at a rate of 32.4% via elastic search.The interoperability score, a metric for dataset compatibility to CDEs and other connected datasets, based on relevant criteria such as data field completeness and compliance with common harmonization standards averaged 53.8 out of 100 for test cases. With this project, we aim to automate the most tedious aspects of data harmonization, enhancing efficiency and scalability in biomedical research while decreasing activation energy for federated research.

摘要

在本文中,我们利用大语言模型(LLMs)来加速数据整理,并使数据发现与协调中劳动密集型的环节自动化。这项工作推动了互操作性标准,增强了数据发现能力,通过生成通用数据元素(CDEs)促进生物医学科学领域的人工智能就绪,而CDEs是协调多个数据集的关键。31项研究、各种本体和医学编码系统用作创建CDEs的源材料,从中提取的可用元数据和上下文作为API请求发送到第四代OpenAI GPT模型,以填充每个元数据字段。采用了人工参与循环(HITL)方法来评估生成的CDEs的质量和准确性。为了规范CDEs的生成,我们使用了ElasticSearch和HITL来避免重复的CDEs,而是将它们添加为现有CDEs的潜在别名。生成的CDEs是评估数据集互操作性潜力的基础,通过确定有多少数据集列标题可以正确映射到CDEs,以及量化对允许值和数据类型的合规性来实现。主题专家审查了生成的CDEs,确定94.0%的生成元数据字段不需要人工修订。阿尔茨海默病神经影像学倡议(ADNI)和全球帕金森病遗传计划(GP2)的数据表用作互操作性评估的测试用例。通过弹性搜索,所有测试用例的列标题以32.4%的比例成功映射到生成的CDEs。基于数据字段完整性和符合通用协调标准等相关标准,测试用例的互操作性得分(数据集与CDEs及其他相关数据集兼容性的指标)平均为100分中的53.8分。通过这个项目,我们旨在使数据协调中最繁琐的环节自动化,提高生物医学研究的效率和可扩展性,同时降低联合研究的启动成本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0708/11562160/6e99f438ee09/nihpp-2024.10.17.24315618v2-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验