Islamaj Rezarta, Yeganova Lana, Kim Won, Xie Natalie, Wilbur W John, Lu Zhiyong
National Library of Medicine, National Institutes of Health, Bethesda MD, USA.
AMIA Jt Summits Transl Sci Proc. 2020 May 30;2020:259-268. eCollection 2020.
The need to organize a large collection in a manner that facilitates human comprehension is crucial given the ever-increasing volumes of information. In this work, we present PDC (probabilistic distributional clustering), a novel algorithm that, given a document collection, computes disjoint term sets representing topics in the collection. The algorithm relies on probabilities of word co-occurrences to partition the set of terms appearing in the collection of documents into disjoint groups of related terms. In this work, we also present an environment to visualize the computed topics in the term space and retrieve the most related PubMed articles for each group of terms. We illustrate the algorithm by applying it to PubMed documents on the topic of suicide. Suicide is a major public health problem identified as the tenth leading cause of death in the US. In this application, our goal is to provide a global view of the mental health literature pertaining to the subject of suicide, and through this, to help create a rich environment of multifaceted data to guide health care researchers in their endeavor to better understand the breadth, depth and scope of the problem. We demonstrate the usefulness of the proposed algorithm by providing a web portal that allows mental health researchers to peruse the suicide-related literature in PubMed.
鉴于信息量不断增加,以促进人类理解的方式组织大量信息的需求至关重要。在这项工作中,我们提出了概率分布聚类(PDC)算法,这是一种新颖的算法,它在给定文档集的情况下,计算代表该文档集中主题的不相交词集。该算法依靠词共现概率将文档集中出现的词集划分为不相交的相关词组。在这项工作中,我们还展示了一个环境,用于在词空间中可视化计算出的主题,并为每组词检索最相关的PubMed文章。我们通过将该算法应用于关于自杀主题的PubMed文档来说明该算法。自杀是一个重大的公共卫生问题,在美国被确定为第十大死因。在这个应用中,我们的目标是提供与自杀主题相关的心理健康文献的全局视图,并借此帮助创建一个多方面数据丰富的环境,以指导医疗保健研究人员更好地理解该问题的广度、深度和范围。我们通过提供一个允许心理健康研究人员查阅PubMed中与自杀相关文献的网络门户,展示了所提出算法的实用性。