IBM Watson Health, IBM Corporation, Cambridge, MA.
Department of Surgery, Vanderbilt University Medical Center, Nashville, TN.
JCO Clin Cancer Inform. 2021 Jan;5:102-111. doi: 10.1200/CCI.20.00087.
We developed a system to automate analysis of the clinical oncology scientific literature from bibliographic databases and match articles to specific patient cohorts to answer specific questions regarding the efficacy of a treatment. The approach attempts to replicate a clinician's mental processes when reviewing published literature in the context of a patient case. We describe the system and evaluate its performance.
We developed separate ground truth data sets for each of the tasks described in the paper. The first ground truth was used to measure the natural language processing (NLP) accuracy from approximately 1,300 papers covering approximately 3,100 statements and approximately 25 concepts; performance was evaluated using a standard F1 score. The ground truth for the expert classifier model was generated by dividing papers cited in clinical guidelines into a training set and a test set in an 80:20 ratio, and performance was evaluated for accuracy, sensitivity, and specificity.
The NLP models were able to identify individual attributes with a 0.7-0.9 F1 score, depending on the attribute of interest. The expert classifier machine learning model was able to classify the individual records with a 0.93 accuracy (95% CI, 0.9 to 0.96, < .0001), and sensitivity and specificity of 0.95 and 0.91, respectively. Using a decision boundary of 0.5 for the positive (expert) label, the classifier demonstrated an F1 score of 0.92.
The system identified and extracted evidence from the oncology literature with a high degree of accuracy, sensitivity, and specificity. This tool enables timely access to the most relevant biomedical literature, providing critical support to evidence-based practice in areas of rapidly evolving science.
我们开发了一种系统,用于自动分析来自文献数据库的临床肿瘤学科学文献,并将文章与特定的患者队列匹配,以回答关于治疗效果的具体问题。该方法试图在患者病例的背景下复制临床医生在审查已发表文献时的思维过程。我们描述了该系统并评估了其性能。
我们为本文所述的每个任务分别开发了独立的真实数据集。第一个真实数据集用于测量大约 1300 篇涵盖大约 3100 条语句和大约 25 个概念的自然语言处理(NLP)的准确性;使用标准 F1 分数进行评估。专家分类器模型的真实数据集是通过将临床指南中引用的论文划分为训练集和测试集(比例为 80:20)生成的,然后评估准确性、敏感性和特异性。
NLP 模型能够识别出具有 0.7-0.9 F1 分数的单个属性,具体取决于感兴趣的属性。专家分类器机器学习模型能够以 0.93 的准确率(95%CI,0.9 至 0.96,<.0001)对单个记录进行分类,敏感性和特异性分别为 0.95 和 0.91。使用 0.5 作为阳性(专家)标签的决策边界,分类器的 F1 得分为 0.92。
该系统以高度的准确性、敏感性和特异性识别和提取肿瘤学文献中的证据。该工具能够及时访问最相关的生物医学文献,为快速发展科学领域的循证实践提供关键支持。