Sasse Julia, Fabre Guillaume, Fortier Isabel, Zimmermann Pierre, Fluck Juliane
ZB MED - Information Centre for Life Sciences, Cologne, Germany, https://ror.org/0259fwx54.
Maelstrom Research, Research Institute of the McGill University Health Centre, Montreal, Canada.
Stud Health Technol Inform. 2025 May 15;327:848-852. doi: 10.3233/SHTI250479.
The significance of Findable, Accessible, Interoperable, and Reusable (FAIR) data is increasing, particularly in the context of enhancing data reuse in research. The National Research Data Infrastructure for Personal Health Data (NFDI4Health) aims to enhance the findability, reusability, and interoperability of health data derived from epidemiological, clinical, and public health studies. NFDI4Health has established the German Central Health Study Hub to improve health data findability through rich metadata. The Maelstrom Catalog, provided by Maelstrom Research, offers a comprehensive dataset of labeled and harmonized study variables, thereby enhancing the findability and reusability of epidemiological data. Both platforms rely on standardized categorization to optimize data reuse. To facilitate this process, NFDI4Health developed the Metadata Annotation Workbench, which supports metadata annotation with standardized vocabulary. This paper presents an AI solution for automatic classification and annotation integrated into this service, using a BioBERT-based text classifier. The model achieved a weighted F1-score of over 92% and demonstrated improved annotation performance, particularly for non-experts. It accelerates variable categorization, thereby enhancing data findability and re-use. As a result, the categorization of study variables can be accelerated and we are confident that the further development of such AI approaches will reduce curatorial workload and promote semantically annotated interoperable data catalogs.
可查找、可访问、可互操作和可重用(FAIR)数据的重要性日益凸显,尤其是在加强研究中的数据重用方面。国家个人健康数据研究数据基础设施(NFDI4Health)旨在提高源自流行病学、临床和公共卫生研究的健康数据的可查找性、可重用性和互操作性。NFDI4Health已建立德国中央健康研究中心,通过丰富的元数据提高健康数据的可查找性。Maelstrom Research提供的Maelstrom Catalog提供了一个包含标记和统一研究变量的综合数据集,从而提高了流行病学数据的可查找性和可重用性。这两个平台都依赖标准化分类来优化数据重用。为了促进这一过程,NFDI4Health开发了元数据注释工作台,该工作台支持使用标准化词汇进行元数据注释。本文介绍了一种集成到该服务中的用于自动分类和注释的人工智能解决方案,使用基于BioBERT的文本分类器。该模型的加权F1分数超过92%,并展示了改进的注释性能,尤其是对于非专家而言。它加速了变量分类,从而提高了数据的可查找性和再利用。因此,可以加速研究变量的分类,并且我们相信这种人工智能方法的进一步发展将减少管理工作量并促进语义注释的可互操作数据目录。