Shivade Chaitanya, Hebert Courtney, Regan Kelly, Fosler-Lussier Eric, Lai Albert M
Department of Computer Science and Engineering.
Department of Biomedical Informatics, The Ohio State University, Columbus, OH.
AMIA Annu Symp Proc. 2017 Feb 10;2016:1149-1158. eCollection 2016.
Clinical trial coordinators refer to both structured and unstructured sources of data when evaluating a subject for eligibility. While some eligibility criteria can be resolved using structured data, some require manual review of clinical notes. An important step in automating the trial screening process is to be able to identify the right data source for resolving each criterion. In this work, we discuss the creation of an eligibility criteria dataset for clinical trials for patients with two disparate diseases, annotated with the preferred data source for each criterion (i.e., structured or unstructured) by annotators with medical training. The dataset includes 50 heart-failure trials with a total of 766 eligibility criteria and 50 trials for chronic lymphocytic leukemia (CLL) with 677 criteria. Further, we developed machine learning models to predict the preferred data source: kernel methods outperform simpler learning models when used with a combination of lexical, syntactic, semantic, and surface features. Evaluation of these models indicates that the performance is consistent across data from both diagnoses, indicating generalizability of our method. Our findings are an important step towards ongoing efforts for automation of clinical trial screening.
在评估受试者是否符合条件时,临床试验协调员会参考结构化和非结构化数据源。虽然一些入选标准可以通过结构化数据解决,但有些则需要人工审查临床记录。实现试验筛选过程自动化的一个重要步骤是能够识别用于解决每个标准的正确数据源。在这项工作中,我们讨论了为患有两种不同疾病的患者创建临床试验入选标准数据集的问题,由受过医学培训的注释者为每个标准(即结构化或非结构化)标注首选数据源。该数据集包括50项心力衰竭试验,共有766条入选标准,以及50项慢性淋巴细胞白血病(CLL)试验,有677条标准。此外,我们开发了机器学习模型来预测首选数据源:当与词汇、句法、语义和表面特征结合使用时,核方法优于更简单的学习模型。对这些模型的评估表明,两种诊断数据的性能一致,表明我们方法具有通用性。我们的研究结果是朝着临床试验筛选自动化的持续努力迈出的重要一步。