Semantic Computing Group, Cluster of Excellence Cognitive Interaction Technology (CITEC), Bielefeld University, Bielefeld, 33619, Germany.
J Biomed Semantics. 2022 May 23;13(1):14. doi: 10.1186/s13326-022-00271-7.
The evidence-based medicine paradigm requires the ability to aggregate and compare outcomes of interventions across different trials. This can be facilitated and partially automatized by information extraction systems. In order to support the development of systems that can extract information from published clinical trials at a fine-grained and comprehensive level to populate a knowledge base, we present a richly annotated corpus at two levels. At the first level, entities that describe components of the PICO elements (e.g., population's age and pre-conditions, dosage of a treatment, etc.) are annotated. The second level comprises schema-level (i.e., slot-filling templates) annotations corresponding to complex PICO elements and other concepts related to a clinical trial (e.g. the relation between an intervention and an arm, the relation between an outcome and an intervention, etc.).
The final corpus includes 211 annotated clinical trial abstracts with substantial agreement between annotators at the entity and scheme level. The mean Kappa value for the glaucoma and T2DM corpora was 0.74 and 0.68, respectively, for single entities. The micro-averaged F score to measure inter-annotator agreement for complex entities (i.e. slot-filling templates) was 0.81.The BERT-base baseline method for entity recognition achieved average micro- F scores of 0.76 for glaucoma and 0.77 for diabetes with exact matching.
In this work, we have created a corpus that goes beyond the existing clinical trial corpora, since it is annotated in a schematic way that represents the classes and properties defined in an ontology. Although the corpus is small, it has fine-grained annotations and could be used to fine-tune pre-trained machine learning models and transformers to the specific task of extracting information about clinical trial abstracts.For future work, we will use the corpus for training information extraction systems that extract single entities, and predict template slot-fillers (i.e., class data/object properties) to populate a knowledge base that relies on the C-TrO ontology for the description of clinical trials. The resulting corpus and the code to measure inter-annotation agreement and the baseline method are publicly available at https://zenodo.org/record/6365890.
循证医学范式需要能够在不同试验中汇总和比较干预措施的结果。这可以通过信息提取系统来促进和部分自动化。为了支持开发能够从已发表的临床试验中提取信息的系统,以便在精细和全面的层面上填充知识库,我们提出了一个在两个层面上进行丰富标注的语料库。在第一个层面上,对描述 PICO 元素组成部分的实体(例如,人群的年龄和前提条件、治疗剂量等)进行了标注。第二个层面包括与临床试验相关的复杂 PICO 元素和其他概念的模式级(即插槽填充模板)标注(例如,干预措施与臂之间的关系、结局与干预措施之间的关系等)。
最终的语料库包括 211 个标注的临床试验摘要,注释者在实体和模式层面上具有高度一致性。青光眼和 T2DM 语料库的平均 Kappa 值分别为 0.74 和 0.68,用于单个实体。用于测量复杂实体(即插槽填充模板)的注释者间一致性的微平均 F 分数为 0.81。用于实体识别的 BERT-base 基线方法在青光眼方面的平均微 F 分数为 0.76,在糖尿病方面的分数为 0.77,均采用精确匹配。
在这项工作中,我们创建了一个超越现有临床试验语料库的语料库,因为它是按照表示本体中定义的类和属性的模式进行标注的。尽管该语料库规模较小,但它具有细粒度的标注,可以用于微调预训练的机器学习模型和转换器,以适应从临床试验摘要中提取信息的特定任务。对于未来的工作,我们将使用该语料库来训练信息提取系统,提取单个实体,并预测模板插槽填充器(即类数据/对象属性),以填充依赖 C-TrO 本体描述临床试验的知识库。生成的语料库和用于测量注释者间一致性和基线方法的代码可在 https://zenodo.org/record/6365890 上公开获取。