Department of Biomedical Informatics & Medical Education, University of Washington, Seattle, WA, USA.
Khoury College of Computer Science, Northeastern University, Seattle, WA, USA.
Sci Data. 2022 Aug 11;9(1):490. doi: 10.1038/s41597-022-01521-0.
Identifying cohorts of patients based on eligibility criteria such as medical conditions, procedures, and medication use is critical to recruitment for clinical trials. Such criteria are often most naturally described in free-text, using language familiar to clinicians and researchers. In order to identify potential participants at scale, these criteria must first be translated into queries on clinical databases, which can be labor-intensive and error-prone. Natural language processing (NLP) methods offer a potential means of such conversion into database queries automatically. However they must first be trained and evaluated using corpora which capture clinical trials criteria in sufficient detail. In this paper, we introduce the Leaf Clinical Trials (LCT) corpus, a human-annotated corpus of over 1,000 clinical trial eligibility criteria descriptions using highly granular structured labels capturing a range of biomedical phenomena. We provide details of our schema, annotation process, corpus quality, and statistics. Additionally, we present baseline information extraction results on this corpus as benchmarks for future work.
基于医疗条件、程序和药物使用等标准来确定患者队列对于临床试验的招募至关重要。这些标准通常最自然地用临床医生和研究人员熟悉的语言以自由文本形式描述。为了大规模地确定潜在参与者,首先必须将这些标准转换为临床数据库上的查询,这可能既费力又容易出错。自然语言处理 (NLP) 方法提供了一种将其自动转换为数据库查询的潜在手段。但是,它们必须首先使用充分详细地捕获临床试验标准的语料库进行训练和评估。在本文中,我们介绍了 Leaf 临床试验 (LCT) 语料库,这是一个使用高度精细的结构化标签注释的超过 1000 个临床试验资格标准描述的人工注释语料库,这些标签捕获了一系列生物医学现象。我们提供了有关我们的方案、注释过程、语料库质量和统计信息的详细信息。此外,我们还展示了该语料库的基本信息提取结果,作为未来工作的基准。