Biomedical Knowledge Engineering Laboratory, BK21 College of Dentistry, Seoul National University, 28 Yeongeon-dong, Jongro-gu, Seoul 110-749, Republic of Korea.
J Biomed Inform. 2010 Feb;43(1):31-40. doi: 10.1016/j.jbi.2009.07.006. Epub 2009 Jul 25.
Concurrent with progress in biomedical sciences, an overwhelming of textual knowledge is accumulating in the biomedical literature. PubMed is the most comprehensive database collecting and managing biomedical literature. To help researchers easily understand collections of PubMed abstracts, numerous clustering methods have been proposed to group similar abstracts based on their shared features. However, most of these methods do not explore the semantic relationships among groupings of documents, which could help better illuminate the groupings of PubMed abstracts. To address this issue, we proposed an ontological clustering method called GOClonto for conceptualizing PubMed abstracts. GOClonto uses latent semantic analysis (LSA) and gene ontology (GO) to identify key gene-related concepts and their relationships as well as allocate PubMed abstracts based on these key gene-related concepts. Based on two PubMed abstract collections, the experimental results show that GOClonto is able to identify key gene-related concepts and outperforms the STC (suffix tree clustering) algorithm, the Lingo algorithm, the Fuzzy Ants algorithm, and the clustering based TRS (tolerance rough set) algorithm. Moreover, the two ontologies generated by GOClonto show significant informative conceptual structures.
随着生物医学科学的发展,生物医学文献中的文本知识呈爆炸式增长。PubMed 是收集和管理生物医学文献的最全面的数据库。为了帮助研究人员轻松理解 PubMed 摘要集,已经提出了许多聚类方法,这些方法根据其共享的特征对相似的摘要进行分组。然而,这些方法大多没有探索文档分组之间的语义关系,而这些关系可以帮助更好地阐明 PubMed 摘要的分组。为了解决这个问题,我们提出了一种名为 GOClonto 的基于本体的聚类方法,用于对 PubMed 摘要进行概念化。GOClonto 使用潜在语义分析 (LSA) 和基因本体 (GO) 来识别关键基因相关概念及其关系,并根据这些关键基因相关概念对 PubMed 摘要进行分配。基于两个 PubMed 摘要集,实验结果表明,GOClonto 能够识别关键基因相关概念,并优于后缀树聚类 (STC) 算法、Lingo 算法、模糊蚂蚁算法和基于 TRS(容忍粗糙集)算法的聚类。此外,GOClonto 生成的两个本体显示出具有显著信息的概念结构。