Vanteru Bhanu C, Shaik Jahangheer S, Yeasin Mohammed
Electrical and Computer Engineering Department, University of Memphis, Memphis, Tennessee, USA.
BMC Genomics. 2008;9 Suppl 1(Suppl 1):S10. doi: 10.1186/1471-2164-9-S1-S10.
The technological advances in the past decade have lead to massive progress in the field of biotechnology. The documentation of the progress made exists in the form of research articles. The PubMed is the current most used repository for bio-literature. PubMed consists of about 17 million abstracts as of 2007 that require methods to efficiently retrieve and browse large volume of relevant information. The State-of-the-art technologies such as GOPubmed use simple keyword-based techniques for retrieving abstracts from the PubMed and linking them to the Gene Ontology (GO). This paper changes the paradigm by introducing semantics enabled technique to link the PubMed to the Gene Ontology, called, SEGOPubmed for ontology-based browsing. Latent Semantic Analysis (LSA) framework is used to semantically interface PubMed abstracts to the Gene Ontology.
The Empirical analysis is performed to compare the performance of the SEGOPubmed with the GOPubmed. The analysis is initially performed using a few well-referenced query words. Further, statistical analysis is performed using GO curated dataset as ground truth. The analysis suggests that the SEGOPubmed performs better than the classic GOPubmed as it incorporates semantics.
The LSA technique is applied on the PubMed abstracts obtained based on the user query and the semantic similarity between the query and the abstracts. The analyses using well-referenced keywords show that the proposed semantic-sensitive technique outperformed the string comparison based techniques in associating the relevant abstracts to the GO terms. The SEGOPubmed also extracted the abstracts in which the keywords do not appear in isolation (i.e. they appear in combination with other terms) that could not be retrieved by simple term matching techniques.
过去十年的技术进步推动了生物技术领域的巨大进展。这些进展的记录以研究文章的形式存在。PubMed是当前最常用的生物文献库。截至2007年,PubMed包含约1700万篇摘要,需要有效的方法来高效检索和浏览大量相关信息。诸如GOPubmed等先进技术使用基于简单关键词的技术从PubMed中检索摘要并将它们与基因本体(GO)相链接。本文引入了支持语义的技术将PubMed与基因本体相链接,即基于本体浏览的SEGOPubmed,从而改变了这一模式。潜在语义分析(LSA)框架用于将PubMed摘要与基因本体进行语义对接。
进行实证分析以比较SEGOPubmed和GOPubmed的性能。分析最初使用一些引用广泛的查询词进行。此外,使用经GO策划的数据集作为基本事实进行统计分析。分析表明,SEGOPubmed由于纳入了语义,其性能优于经典的GOPubmed。
LSA技术应用于基于用户查询获得的PubMed摘要以及查询与摘要之间的语义相似性。使用引用广泛的关键词进行的分析表明,所提出的语义敏感技术在将相关摘要与GO术语相关联方面优于基于字符串比较的技术。SEGOPubmed还提取了那些关键词不是单独出现(即它们与其他术语组合出现)的摘要,而这些摘要无法通过简单的词匹配技术检索到。