Suppr超能文献

临床文档语料库——真实语料库、翻译语料库和合成替代语料库,以及各类领域替代语料库:语料库设计多样性调查,重点关注德语文本数据

Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data.

作者信息

Hahn Udo

机构信息

Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, D-04107 Leipzig, Saxony, Germany.

出版信息

JAMIA Open. 2025 May 14;8(3):ooaf024. doi: 10.1093/jamiaopen/ooaf024. eCollection 2025 Jun.

Abstract

OBJECTIVE

We survey clinical document corpora, with a focus on German textual data. Due to rigid data privacy legislation in Germany, these resources, with only few exceptions, are stored in protected clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing, where easy accessibility and reuse of (textual) data collections are common practice. Hence, alternative corpus designs have been examined to escape from data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several types of domain proxies have come up as substitutes for real clinical documents. Common instances of close proxies are medical journal publications, therapy guidelines, drug labels, etc., more distant proxies include medical contents from social media channels or online encyclopedic medical articles.

METHODS

We follow the PRISM (Preferred Reporting Items for Systematic reviews and Meta-analyses) guidelines for surveying the field of German-language clinical/medical corpora. Four bibliographic databases were searched: PubMed, ACL Anthology, Google Scholar, and the author's personal literature database.

RESULTS

After PRISM-conformant identification of 362 hits from the 4 bibliographic systems, the screening process yielded 78 relevant documents for inclusion in this review. They contained overall 92 different published versions of corpora, from which 71 were truly unique in terms of their underlying document sets. Out of these, the majority were clinical corpora-46 real ones from which 32 were unique, 5 translated ones (3 unique), and 6 synthetic ones (3 unique). As to domain proxies, we identified 18 close ones (16 unique) and 17 distant ones (all of them unique).

DISCUSSION

There is a clear divide between the large number of non-accessible real clinical German-language corpora and their publicly accessible substitutes: translated or synthetic datasets, close or more distant proxies. So, at first sight, the data bottleneck seems broken. Intuitively, yet, differences in genre-specific writing style, lexical or terminological diction, and required medical background expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are. A systematic, empirically grounded yardstick for comparing real clinical corpora with those suggested substitutes and proxies is missing until now.

CONCLUSION

The extreme sparsity of real clinical corpora in almost all non-Anglo-American countries worldwide, Germany in particular, has triggered an active search for alternative, publicly accessible data resources laid out in this survey. However, the utility of these substitutes compared with real clinical corpora and their semantic and genre-specific distance to real clinical corpora is still under-researched so that their value remains to be assessed properly. Furthermore, corpus descriptions are often incomplete with respect to relevant descriptional attributes. This paper bundles these observations and proposes a template for a so-called corpus card to improve adequate corpus documentation.

摘要

目的

我们对临床文档语料库进行了调查,重点关注德语文本数据。由于德国严格的数据隐私法规,这些资源(极少数情况除外)存储在受保护的临床数据空间中,不对临床外部的研究人员开放。这种情况与自然语言处理领域既定的工作流程形成鲜明对比,在该领域,(文本)数据集的易于获取和重用是常见做法。因此,人们研究了替代语料库设计以摆脱数据匮乏的状况。除了对英语临床数据集进行机器翻译以及生成具有虚拟临床内容的合成语料库外,还出现了几种类型的领域代理来替代真实的临床文档。紧密代理的常见实例包括医学期刊出版物、治疗指南、药品标签等,较远的代理包括来自社交媒体渠道的医学内容或在线百科医学文章。

方法

我们遵循PRISM(系统评价和Meta分析的首选报告项目)指南对德语临床/医学语料库领域进行调查。搜索了四个书目数据库:PubMed、ACL文集、谷歌学术和作者的个人文献数据库。

结果

在按照PRISM标准从这4个书目系统中识别出362条命中记录后,筛选过程产生了78篇相关文献纳入本综述。它们总共包含92个不同的已发布语料库版本,其中71个在其基础文档集方面是真正独特的。其中,大多数是临床语料库——46个真实语料库,其中32个是独特的,5个翻译语料库(3个独特),6个合成语料库(3个独特)。至于领域代理,我们识别出18个紧密代理(16个独特)和17个较远代理(全部独特)。

讨论

大量无法获取的真实德语临床语料库与其可公开获取的替代品(翻译或合成数据集、紧密或较远代理)之间存在明显差异。所以,乍一看,数据瓶颈似乎已被打破。然而,直观地说,在这种类型空间中特定体裁的写作风格、词汇或术语词典以及所需的医学背景专业知识方面的差异也很明显。这就提出了一个问题,即替代语料库设计到底有多有效。到目前为止,还缺少一个系统的、基于实证的标准来比较真实临床语料库与那些建议的替代品和代理。

结论

全球几乎所有非英美国家,尤其是德国,真实临床语料库极度稀少,这引发了人们积极寻找本调查中列出的替代的、可公开获取的数据资源。然而,与真实临床语料库相比,这些替代品的效用以及它们与真实临床语料库在语义和特定体裁上的距离仍未得到充分研究,因此它们的价值仍有待正确评估。此外,语料库描述在相关描述属性方面往往不完整。本文汇总了这些观察结果,并提出了一个所谓语料库卡片的模板,以改进适当的语料库文档记录。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8d6d/12077144/bd415ce2038c/ooaf024f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验