Caufield John Harry, Liem David A, Garlid Anders O, Zhou Yijiang, Watson Karol, Bui Alex A T, Wang Wei, Ping Peipei
The NIH BD2K Center of Excellence in Biomedical Computing, University of California, Los Angeles; Department of Physiology, University of California, Los Angeles;
The NIH BD2K Center of Excellence in Biomedical Computing, University of California, Los Angeles; Department of Physiology, University of California, Los Angeles; Department of Medicine/Cardiology, University of California, Los Angeles.
J Vis Exp. 2018 Sep 20(139):58392. doi: 10.3791/58392.
Clinical case reports (CCRs) are a valuable means of sharing observations and insights in medicine. The form of these documents varies, and their content includes descriptions of numerous, novel disease presentations and treatments. Thus far, the text data within CCRs is largely unstructured, requiring significant human and computational effort to render these data useful for in-depth analysis. In this protocol, we describe methods for identifying metadata corresponding to specific biomedical concepts frequently observed within CCRs. We provide a metadata template as a guide for document annotation, recognizing that imposing structure on CCRs may be pursued by combinations of manual and automated effort. The approach presented here is appropriate for organization of concept-related text from a large literature corpus (e.g., thousands of CCRs) but may be easily adapted to facilitate more focused tasks or small sets of reports. The resulting structured text data includes sufficient semantic context to support a variety of subsequent text analysis workflows: meta-analyses to determine how to maximize CCR detail, epidemiological studies of rare diseases, and the development of models of medical language may all be made more realizable and manageable through the use of structured text data.
临床病例报告(CCRs)是医学领域分享观察结果和见解的一种重要方式。这些文档的形式各不相同,其内容包括对众多新颖疾病表现和治疗方法的描述。到目前为止,CCRs中的文本数据在很大程度上是非结构化的,需要大量人力和计算工作才能使这些数据用于深入分析。在本方案中,我们描述了识别CCRs中经常出现的特定生物医学概念相关元数据的方法。我们提供了一个元数据模板作为文档注释的指南,认识到可以通过手动和自动相结合的方式对CCRs施加结构。这里提出的方法适用于从大型文献语料库(例如,数千份CCRs)中组织与概念相关的文本,但也可以很容易地进行调整,以促进更有针对性的任务或少量报告的处理。由此产生的结构化文本数据包括足够的语义上下文,以支持各种后续文本分析工作流程:通过使用结构化文本数据,确定如何最大化CCR细节的荟萃分析、罕见病的流行病学研究以及医学语言模型的开发都可能变得更切实可行和易于管理。