Department of Biomedical Informatics, Columbia University, New York, New York 10032, USA.
J Am Med Inform Assoc. 2011 Dec;18 Suppl 1(Suppl 1):i116-24. doi: 10.1136/amiajnl-2011-000321. Epub 2011 Jul 31.
To develop a semantic representation for clinical research eligibility criteria to automate semistructured information extraction from eligibility criteria text.
An analysis pipeline called eligibility criteria extraction and representation (EliXR) was developed that integrates syntactic parsing and tree pattern mining to discover common semantic patterns in 1000 eligibility criteria randomly selected from http://ClinicalTrials.gov. The semantic patterns were aggregated and enriched with unified medical language systems semantic knowledge to form a semantic representation for clinical research eligibility criteria.
The authors arrived at 175 semantic patterns, which form 12 semantic role labels connected by their frequent semantic relations in a semantic network.
Three raters independently annotated all the sentence segments (N=396) for 79 test eligibility criteria using the 12 top-level semantic role labels. Eight-six per cent (339) of the sentence segments were unanimously labelled correctly and 13.8% (55) were correctly labelled by two raters. The Fleiss' κ was 0.88, indicating a nearly perfect interrater agreement.
This study present a semi-automated data-driven approach to developing a semantic network that aligns well with the top-level information structure in clinical research eligibility criteria text and demonstrates the feasibility of using the resulting semantic role labels to generate semistructured eligibility criteria with nearly perfect interrater reliability.
开发一种临床研究入选标准的语义表示,以实现从入选标准文本中自动进行半结构化信息提取。
开发了一个名为入选标准抽取与表示(EliXR)的分析管道,该管道集成了句法分析和树模式挖掘,以从 http://ClinicalTrials.gov 中随机抽取的 1000 条入选标准中发现常见的语义模式。这些语义模式经过汇总并与统一医学语言系统语义知识进行了丰富,以形成临床研究入选标准的语义表示。
作者共得出 175 种语义模式,这些模式形成了一个语义网络,其中包含 12 个语义角色标签,通过它们的频繁语义关系连接在一起。
3 名评估者独立使用 12 个顶级语义角色标签对 79 条测试入选标准的所有句子片段(N=396)进行了标注。86%(339)的句子片段得到了一致正确的标注,13.8%(55)的句子片段得到了两名评估者的正确标注。Fleiss' κ 值为 0.88,表明评估者之间存在近乎完美的一致性。
本研究提出了一种半自动的数据驱动方法,用于开发一个与临床研究入选标准文本中的顶级信息结构很好对齐的语义网络,并展示了使用由此产生的语义角色标签生成半结构化入选标准的可行性,具有近乎完美的评估者间可靠性。