Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Cambridge, Massachusetts, USA.
J Am Med Inform Assoc. 2014 Sep-Oct;21(5):824-32. doi: 10.1136/amiajnl-2013-002443. Epub 2014 Jan 15.
Pathology reports are rich in narrative statements that encode a complex web of relations among medical concepts. These relations are routinely used by doctors to reason on diagnoses, but often require hand-crafted rules or supervised learning to extract into prespecified forms for computational disease modeling. We aim to automatically capture relations from narrative text without supervision.
We design a novel framework that translates sentences into graph representations, automatically mines sentence subgraphs, reduces redundancy in mined subgraphs, and automatically generates subgraph features for subsequent classification tasks. To ensure meaningful interpretations over the sentence graphs, we use the Unified Medical Language System Metathesaurus to map token subsequences to concepts, and in turn sentence graph nodes. We test our system with multiple lymphoma classification tasks that together mimic the differential diagnosis by a pathologist. To this end, we prevent our classifiers from looking at explicit mentions or synonyms of lymphomas in the text.
We compare our system with three baseline classifiers using standard n-grams, full MetaMap concepts, and filtered MetaMap concepts. Our system achieves high F-measures on multiple binary classifications of lymphoma (Burkitt lymphoma, 0.8; diffuse large B-cell lymphoma, 0.909; follicular lymphoma, 0.84; Hodgkin lymphoma, 0.912). Significance tests show that our system outperforms all three baselines. Moreover, feature analysis identifies subgraph features that contribute to improved performance; these features agree with the state-of-the-art knowledge about lymphoma classification. We also highlight how these unsupervised relation features may provide meaningful insights into lymphoma classification.
病理报告中富含叙述性陈述,这些陈述编码了医学概念之间复杂的关系网络。这些关系通常被医生用于诊断推理,但通常需要手工规则或监督学习才能提取为预定义形式,以便进行计算疾病建模。我们旨在自动从叙述性文本中捕获关系,而无需监督。
我们设计了一个新颖的框架,该框架将句子转换为图表示形式,自动挖掘句子子图,减少挖掘出的子图中的冗余,并自动生成子图特征,以用于后续的分类任务。为了确保对句子图进行有意义的解释,我们使用统一医学语言系统元词表将标记子序列映射到概念,进而映射到句子图节点。我们使用多个淋巴瘤分类任务来测试我们的系统,这些任务共同模拟病理学家的鉴别诊断。为此,我们防止分类器在文本中查看淋巴瘤的显式提及或同义词。
我们使用标准 n-gram、完整的 MetaMap 概念和过滤后的 MetaMap 概念,将我们的系统与三个基线分类器进行比较。我们的系统在多个淋巴瘤的二元分类(伯基特淋巴瘤,0.8;弥漫性大 B 细胞淋巴瘤,0.909;滤泡性淋巴瘤,0.84;霍奇金淋巴瘤,0.912)中取得了较高的 F 度量值。显著性检验表明,我们的系统优于所有三个基线。此外,特征分析确定了对子图特征的贡献,这些特征提高了性能;这些特征与关于淋巴瘤分类的最新知识一致。我们还强调了这些无监督关系特征如何为淋巴瘤分类提供有意义的见解。