Hao Tianyong, Liu Hongfang, Weng Chunhua
Chunhua Weng, Ph.D., Department of Biomedical Informatics, Columbia University, New York City, 622 W 168th Street, PH-20, New York, NY 10032, USA, E-mail:
Methods Inf Med. 2016 May 17;55(3):266-75. doi: 10.3414/ME15-01-0112. Epub 2016 Mar 4.
To develop an automated method for extracting and structuring numeric lab test comparison statements from text and evaluate the method using clinical trial eligibility criteria text.
Leveraging semantic knowledge from the Unified Medical Language System (UMLS) and domain knowledge acquired from the Internet, Valx takes seven steps to extract and normalize numeric lab test expressions: 1) text preprocessing, 2) numeric, unit, and comparison operator extraction, 3) variable identification using hybrid knowledge, 4) variable - numeric association, 5) context-based association filtering, 6) measurement unit normalization, and 7) heuristic rule-based comparison statements verification. Our reference standard was the consensus-based annotation among three raters for all comparison statements for two variables, i.e., HbA1c and glucose, identified from all of Type 1 and Type 2 diabetes trials in ClinicalTrials.gov.
The precision, recall, and F-measure for structuring HbA1c comparison statements were 99.6%, 98.1%, 98.8% for Type 1 diabetes trials, and 98.8%, 96.9%, 97.8% for Type 2 diabetes trials, respectively. The precision, recall, and F-measure for structuring glucose comparison statements were 97.3%, 94.8%, 96.1% for Type 1 diabetes trials, and 92.3%, 92.3%, 92.3% for Type 2 diabetes trials, respectively.
Valx is effective at extracting and structuring free-text lab test comparison statements in clinical trial summaries. Future studies are warranted to test its generalizability beyond eligibility criteria text. The open-source Valx enables its further evaluation and continued improvement among the collaborative scientific community.
开发一种从文本中提取并构建数字实验室检查比较语句的自动化方法,并使用临床试验入选标准文本对该方法进行评估。
Valx利用统一医学语言系统(UMLS)的语义知识和从互联网获取的领域知识,通过七个步骤来提取和规范化数字实验室检查表达式:1)文本预处理;2)数字、单位和比较运算符提取;3)使用混合知识进行变量识别;4)变量与数字关联;5)基于上下文的关联过滤;6)测量单位规范化;7)基于启发式规则的比较语句验证。我们的参考标准是由三名评分者对从ClinicalTrials.gov中所有1型和2型糖尿病试验中识别出的两个变量(即糖化血红蛋白和葡萄糖)的所有比较语句达成的基于共识的注释。
构建糖化血红蛋白比较语句时,1型糖尿病试验的精确率、召回率和F值分别为99.6%、98.1%、98.8%,2型糖尿病试验分别为98.8%、96.9%、97.8%。构建葡萄糖比较语句时,1型糖尿病试验的精确率、召回率和F值分别为97.3%、94.8%、96.1%,2型糖尿病试验分别为92.3%、92.3%、92.3%。
Valx在提取和构建临床试验摘要中的自由文本实验室检查比较语句方面是有效的。未来有必要进行研究以测试其在入选标准文本之外的通用性。开源的Valx使其能够在合作科学界中得到进一步评估和持续改进。