Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA.
Department of Community Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, Minnesota, USA.
J Am Med Inform Assoc. 2014 Sep-Oct;21(5):876-84. doi: 10.1136/amiajnl-2013-002463. Epub 2014 May 15.
To specify the problem of patient-level temporal aggregation from clinical text and introduce several probabilistic methods for addressing that problem. The patient-level perspective differs from the prevailing natural language processing (NLP) practice of evaluating at the term, event, sentence, document, or visit level.
We utilized an existing pediatric asthma cohort with manual annotations. After generating a basic feature set via standard clinical NLP methods, we introduce six methods of aggregating time-distributed features from the document level to the patient level. These aggregation methods are used to classify patients according to their asthma status in two hypothetical settings: retrospective epidemiology and clinical decision support.
In both settings, solid patient classification performance was obtained with machine learning algorithms on a number of evidence aggregation methods, with Sum aggregation obtaining the highest F1 score of 85.71% on the retrospective epidemiological setting, and a probability density function-based method obtaining the highest F1 score of 74.63% on the clinical decision support setting. Multiple techniques also estimated the diagnosis date (index date) of asthma with promising accuracy.
The clinical decision support setting is a more difficult problem. We rule out some aggregation methods rather than determining the best overall aggregation method, since our preliminary data set represented a practical setting in which manually annotated data were limited.
Results contrasted the strengths of several aggregation algorithms in different settings. Multiple approaches exhibited good patient classification performance, and also predicted the timing of estimates with reasonable accuracy.
从临床文本中明确患者层面时间聚合的问题,并介绍几种解决该问题的概率方法。患者层面的视角与当前自然语言处理(NLP)的主流实践不同,后者通常在术语、事件、句子、文档或就诊层面进行评估。
我们利用现有的儿科哮喘队列和手动注释。在通过标准临床 NLP 方法生成基本特征集后,我们引入了六种从文档层面到患者层面聚合时间分布特征的方法。这些聚合方法用于根据患者的哮喘状况对其进行分类,分为两种假设情况:回顾性流行病学和临床决策支持。
在这两种情况下,机器学习算法在许多证据聚合方法上均获得了良好的患者分类性能,Sum 聚合在回顾性流行病学设置中获得了最高的 F1 得分为 85.71%,基于概率密度函数的方法在临床决策支持设置中获得了最高的 F1 得分为 74.63%。多种技术还可以准确估计哮喘的诊断日期(索引日期)。
临床决策支持设置是一个更困难的问题。我们排除了一些聚合方法,而不是确定最佳的总体聚合方法,因为我们的初步数据集代表了一个实际情况,其中手动注释的数据有限。
结果对比了不同设置下几种聚合算法的优势。多种方法表现出良好的患者分类性能,并且还可以合理准确地预测估计的时间。