Department of Computer Science, Jamia Millia Islamia (A Central University), New Delhi, India.
J Biomed Inform. 2010 Dec;43(6):1020-35. doi: 10.1016/j.jbi.2010.09.008. Epub 2010 Sep 24.
A number of techniques such as information extraction, document classification, document clustering and information visualization have been developed to ease extraction and understanding of information embedded within text documents. However, knowledge that is embedded in natural language texts is difficult to extract using simple pattern matching techniques and most of these methods do not help users directly understand key concepts and their semantic relationships in document corpora, which are critical for capturing their conceptual structures. The problem arises due to the fact that most of the information is embedded within unstructured or semi-structured texts that computers can not interpret very easily. In this paper, we have presented a novel Biomedical Knowledge Extraction and Visualization framework, BioKEVis to identify key information components from biomedical text documents. The information components are centered on key concepts. BioKEVis applies linguistic analysis and Latent Semantic Analysis (LSA) to identify key concepts. The information component extraction principle is based on natural language processing techniques and semantic-based analysis. The system is also integrated with a biomedical named entity recognizer, ABNER, to tag genes, proteins and other entity names in the text. We have also presented a method for collating information extracted from multiple sources to generate semantic network. The network provides distinct user perspectives and allows navigation over documents with similar information components and is also used to provide a comprehensive view of the collection. The system stores the extracted information components in a structured repository which is integrated with a query-processing module to handle biomedical queries over text documents. We have also proposed a document ranking mechanism to present retrieved documents in order of their relevance to the user query.
已经开发了许多技术,例如信息提取、文档分类、文档聚类和信息可视化,以简化对文本文件中嵌入的信息的提取和理解。然而,嵌入在自然语言文本中的知识很难使用简单的模式匹配技术提取,并且这些方法中的大多数都无法帮助用户直接理解文档语料库中的关键概念及其语义关系,而这对于捕获概念结构至关重要。这个问题的出现是因为大多数信息都嵌入在计算机不易解释的非结构化或半结构化文本中。在本文中,我们提出了一种新颖的生物医学知识提取和可视化框架 BioKEVis,用于从生物医学文本文档中识别关键信息组件。信息组件以关键概念为中心。BioKEVis 应用语言分析和潜在语义分析 (LSA) 来识别关键概念。信息组件提取原则基于自然语言处理技术和基于语义的分析。该系统还与生物医学命名实体识别器 ABNER 集成,以标记文本中的基因、蛋白质和其他实体名称。我们还提出了一种从多个来源整理信息以生成语义网络的方法。该网络提供了不同的用户视角,并允许在具有相似信息组件的文档上进行导航,还用于提供集合的全面视图。该系统将提取的信息组件存储在一个结构化存储库中,该存储库与查询处理模块集成,以处理对文本文档的生物医学查询。我们还提出了一种文档排名机制,以便根据与用户查询的相关性对检索到的文档进行排序。