Doing-Harris Kristina, Patterson Olga, Igo Sean, Hurdle John
Department of Biomedical Informatics, University of Utah, Health Sciences Center, Salt Lake City, UT.
VA SLC Health Care, Salt Lake City, UT.
Proc ACM Int Workshop Data Text Min Biomed Inform. 2013 Oct-Nov;2013:9-12. doi: 10.1145/2512089.2512101.
This paper reports on a set of studies designed to identify sublanguages in documents for domain-specific processing across institutions. Psychological evidence indicates that humans use context-specific linguistic information when they read. Natural Language Processing (NLP) pipelines are successful within specific domains (i.e., contexts). To limit the number of domain-specific NLP systems, a natural focus would be on sublanguages. Sublanguages are identified by shared lexical and semantic features.[1] Patterson and Hurdle[2] developed a sublanguage identification system that functioned well for 12 clinical specialties at the University of Utah. The current work compares sublanguages across institutions. Using a clinical NLP pipeline augmented by a new document corpus from the University of Pittsburg (UPitt), new documents were assigned to clusters based on the minimum cosine-distance to a Utah cluster centroid. The UPitt documents were divided into a nine-group specialty corpus. Across institutions, five of the specialty groups fell within the expected clusters. We find that clustering encounters difficulty due to documents with mixed sublanguages; naming convention differences across institutions; and document types used across specialties. The findings indicate that clinical specialty sublanguages can be identified across institutions.
本文报告了一系列旨在识别文档中的子语言以实现跨机构特定领域处理的研究。心理学证据表明,人类阅读时会使用特定上下文的语言信息。自然语言处理(NLP)管道在特定领域(即上下文)内是成功的。为了限制特定领域NLP系统的数量,自然的关注点将是子语言。子语言是通过共享的词汇和语义特征来识别的。[1]帕特森和赫德[2]开发了一种子语言识别系统,该系统在犹他大学的12个临床专业中运行良好。当前的工作比较了不同机构的子语言。使用由匹兹堡大学(UPitt)的新文档语料库增强的临床NLP管道,根据与犹他聚类中心的最小余弦距离将新文档分配到聚类中。UPitt的文档被分为一个九组专业语料库。在不同机构中,五个专业组落在预期的聚类中。我们发现,由于具有混合子语言的文档、不同机构之间的命名约定差异以及各专业使用的文档类型,聚类遇到了困难。研究结果表明,可以跨机构识别临床专业子语言。