Suppr超能文献

利用自然语言处理技术对发布到互联网邮件列表的临床信息进行深入分析:一项可行性研究。

Using natural language processing to enable in-depth analysis of clinical messages posted to an Internet mailing list: a feasibility study.

作者信息

Bekhuis Tanja, Kreinacke Marcos, Spallek Heiko, Song Mei, O'Donnell Jean A

机构信息

Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15232, United States. tcb24 [at] pitt.edu

出版信息

J Med Internet Res. 2011 Nov 23;13(4):e98. doi: 10.2196/jmir.1799.

Abstract

BACKGROUND

An Internet mailing list may be characterized as a virtual community of practice that serves as an information hub with easy access to expert advice and opportunities for social networking. We are interested in mining messages posted to a list for dental practitioners to identify clinical topics. Once we understand the topical domain, we can study dentists' real information needs and the nature of their shared expertise, and can avoid delivering useless content at the point of care in future informatics applications. However, a necessary first step involves developing procedures to identify messages that are worth studying given our resources for planned, labor-intensive research.

OBJECTIVES

The primary objective of this study was to develop a workflow for finding a manageable number of clinically relevant messages from a much larger corpus of messages posted to an Internet mailing list, and to demonstrate the potential usefulness of our procedures for investigators by retrieving a set of messages tailored to the research question of a qualitative research team.

METHODS

We mined 14,576 messages posted to an Internet mailing list from April 2008 to May 2009. The list has about 450 subscribers, mostly dentists from North America interested in clinical practice. After extensive preprocessing, we used the Natural Language Toolkit to identify clinical phrases and keywords in the messages. Two academic dentists classified collocated phrases in an iterative, consensus-based process to describe the topics discussed by dental practitioners who subscribe to the list. We then consulted with qualitative researchers regarding their research question to develop a plan for targeted retrieval. We used selected phrases and keywords as search strings to identify clinically relevant messages and delivered the messages in a reusable database.

RESULTS

About half of the subscribers (245/450, 54.4%) posted messages. Natural language processing (NLP) yielded 279,193 clinically relevant tokens or processed words (19% of all tokens). Of these, 2.02% (5634 unique tokens) represent the vocabulary for dental practitioners. Based on pointwise mutual information score and clinical relevance, 325 collocated phrases (eg, fistula filled obturation and herpes zoster) with 108 keywords (eg, mercury) were classified into 13 broad categories with subcategories. In the demonstration, we identified 305 relevant messages (2.1% of all messages) over 10 selected categories with instances of collocated phrases, and 299 messages (2.1%) with instances of phrases or keywords for the category systemic disease.

CONCLUSIONS

A workflow with a sequence of machine-based steps and human classification of NLP-discovered phrases can support researchers who need to identify relevant messages in a much larger corpus. Discovered phrases and keywords are useful search strings to aid targeted retrieval. We demonstrate the potential value of our procedures for qualitative researchers by retrieving a manageable set of messages concerning systemic and oral disease.

摘要

背景

互联网邮件列表可被视为一个虚拟的实践社区,它作为一个信息中心,能方便地获取专家建议并提供社交网络机会。我们有兴趣挖掘发送到牙科从业者邮件列表中的信息,以识别临床主题。一旦我们了解了主题领域,就能研究牙医的实际信息需求及其共享专业知识的性质,并能在未来的信息学应用中避免在医疗现场提供无用的内容。然而,必要的第一步是制定程序,以便在我们用于计划的、劳动密集型研究的资源条件下,识别出值得研究的信息。

目的

本研究的主要目的是开发一种工作流程,以便从发送到互联网邮件列表的大量信息中找到数量可控的临床相关信息,并通过检索一组针对定性研究团队研究问题量身定制的信息,向研究人员展示我们程序的潜在实用性。

方法

我们挖掘了2008年4月至2009年5月发送到互联网邮件列表的14576条信息。该列表约有450名订阅者,大多数是来自北美的对临床实践感兴趣的牙医。经过广泛的预处理后,我们使用自然语言工具包来识别信息中的临床短语和关键词。两名学术牙医以迭代的、基于共识的过程对搭配短语进行分类,以描述订阅该列表的牙科从业者所讨论的主题。然后,我们就定性研究人员的研究问题进行咨询,以制定有针对性检索的计划。我们使用选定的短语和关键词作为搜索字符串来识别临床相关信息,并将这些信息存储在一个可重复使用的数据库中。

结果

约一半的订阅者(245/450,54.4%)发送了信息。自然语言处理(NLP)产生了279193个临床相关的词元或处理后的单词(占所有词元的19%)。其中,2.02%(5634个独特词元)代表牙科从业者的词汇表。基于逐点互信息得分和临床相关性,325个搭配短语(如瘘管充填封闭术和带状疱疹)与108个关键词(如汞)被分类为13个大类及子类。在演示中,我们在10个选定的类别中识别出305条相关信息(占所有信息的2.1%),这些类别包含搭配短语实例,以及299条(2.1%)包含系统性疾病类别短语或关键词实例的信息。

结论

一系列基于机器的步骤和对NLP发现的短语进行人工分类的工作流程,可以支持需要在大量语料库中识别相关信息的研究人员。发现的短语和关键词是有助于有针对性检索的有用搜索字符串。我们通过检索一组关于系统性和口腔疾病的数量可控的信息,向定性研究人员展示了我们程序的潜在价值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/60c4/3236668/7a46b1771fc7/jmir_v13i4e98_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验