Institute of Psychiatry, Psychology and Neuroscience, King's College London, De Crespigny Park, London, SE5 8AF, UK.
South London and Maudsley NHS Foundation Trust, London, UK.
J Biomed Semantics. 2020 Mar 10;11(1):2. doi: 10.1186/s13326-020-00220-2.
Duration of untreated psychosis (DUP) is an important clinical construct in the field of mental health, as longer DUP can be associated with worse intervention outcomes. DUP estimation requires knowledge about when psychosis symptoms first started (symptom onset), and when psychosis treatment was initiated. Electronic health records (EHRs) represent a useful resource for retrospective clinical studies on DUP, but the core information underlying this construct is most likely to lie in free text, meaning it is not readily available for clinical research. Natural Language Processing (NLP) is a means to addressing this problem by automatically extracting relevant information in a structured form. As a first step, it is important to identify appropriate documents, i.e., those that are likely to include the information of interest. Next, temporal information extraction methods are needed to identify time references for early psychosis symptoms. This NLP challenge requires solving three different tasks: time expression extraction, symptom extraction, and temporal "linking". In this study, we focus on the first step, using two relevant EHR datasets.
We applied a rule-based NLP system for time expression extraction that we had previously adapted to a corpus of mental health EHRs from patients with a diagnosis of schizophrenia (first referrals). We extended this work by applying this NLP system to a larger set of documents and patients, to identify additional texts that would be relevant for our long-term goal, and developed a new corpus from a subset of these new texts (early intervention services). Furthermore, we added normalized value annotations ("2011-05") to the annotated time expressions ("May 2011") in both corpora. The finalized corpora were used for further NLP development and evaluation, with promising results (normalization accuracy 71-86%). To highlight the specificities of our annotation task, we also applied the final adapted NLP system to a different temporally annotated clinical corpus.
Developing domain-specific methods is crucial to address complex NLP tasks such as symptom onset extraction and retrospective calculation of duration of a preclinical syndrome. To the best of our knowledge, this is the first clinical text resource annotated for temporal entities in the mental health domain.
未治疗的精神病持续时间(DUP)是精神健康领域的一个重要临床概念,因为较长的 DUP 可能与较差的干预结果相关。DUP 的估计需要了解精神病症状何时开始(症状出现),以及何时开始精神病治疗。电子健康记录(EHR)代表了对 DUP 进行回顾性临床研究的有用资源,但构成这一概念的核心信息很可能存在于自由文本中,这意味着它不适用于临床研究。自然语言处理(NLP)是一种通过自动以结构化形式提取相关信息来解决此问题的方法。作为第一步,重要的是要识别合适的文档,即那些可能包含感兴趣信息的文档。接下来,需要使用时间信息提取方法来识别早期精神病症状的时间参考。这项 NLP 挑战需要解决三个不同的任务:时间表达式提取、症状提取和时间“链接”。在这项研究中,我们专注于第一步,使用两个相关的 EHR 数据集。
我们应用了一种基于规则的 NLP 系统来进行时间表达式提取,我们之前已经将其应用于来自精神健康 EHR 的精神分裂症患者(首次就诊)的语料库中。我们通过将此 NLP 系统应用于更大的文档和患者集来扩展这项工作,以识别对我们的长期目标相关的其他文本,并从这些新文本的一个子集开发一个新的语料库(早期干预服务)。此外,我们在两个语料库中为已注释的时间表达式(“May 2011”)添加了标准化值注释(“2011-05”)。最终的语料库用于进一步的 NLP 开发和评估,结果令人鼓舞(归一化精度为 71-86%)。为了突出我们的注释任务的特殊性,我们还将最终改编的 NLP 系统应用于另一个具有时间注释的临床语料库。
开发特定于领域的方法对于解决复杂的 NLP 任务(如症状发作提取和回顾性计算临床前综合征的持续时间)至关重要。据我们所知,这是第一个在精神健康领域为时间实体注释的临床文本资源。