San Diego Supercomputer Center, University of California San Diego, 9500 Gilman Drive, Mailcode 0505 La Jolla, CA 92093-0505, USA.
BMC Bioinformatics. 2010 Apr 29;11:220. doi: 10.1186/1471-2105-11-220.
Biological data have traditionally been stored and made publicly available through a variety of on-line databases, whereas biological knowledge has traditionally been found in the printed literature. With journals now on-line and providing an increasing amount of open access content, often free of copyright restriction, this distinction between database and literature is blurring. To exploit this opportunity we present the integration of open access literature with the RCSB Protein Data Bank (PDB).
BioLit provides an enhanced view of articles with markup of semantic data and links to biological databases, based on the content of the article. For example, words matching to existing biological ontologies are highlighted and database identifiers are linked to their database of origin. Among other functions, it identifies PDB IDs that are mentioned in the open access literature, by parsing the full text for all research articles in PubMed Central (PMC) and exposing the results as simple XML Web Services. Here, we integrate BioLit results with the RCSB PDB website by using these services to find PDB IDs that are mentioned in research articles and subsequently retrieving abstract, figures, and text excerpts for those articles. A new RCSB PDB literature view permits browsing through the figures and abstracts of the articles that mention a given structure. The BioLit Web Services that are providing the underlying data are publicly accessible. A client library is provided that supports querying these services (Java).
The integration between literature and websites, as demonstrated here with the RCSB PDB, provides a broader view for how a given structure has been analyzed and used. This approach detects the mention of a PDB structure even if it is not formally cited in the paper. Other structures related through the same literature references can also be identified, possibly providing new scientific insight. To our knowledge this is the first time that database and literature have been integrated in this way and it speaks to the opportunities afforded by open and free access to both database and literature content.
生物数据传统上通过各种在线数据库存储并公开,而生物知识则传统上存在于印刷文献中。随着期刊现在上线并提供越来越多的开放获取内容,通常不受版权限制,数据库和文献之间的这种区别正在变得模糊。为了利用这个机会,我们提出了将开放获取文献与 RCSB 蛋白质数据库 (PDB) 集成。
BioLit 基于文章的内容,通过对语义数据进行标记和链接到生物数据库,提供了对文章的增强视图。例如,与现有生物本体匹配的单词会被突出显示,数据库标识符会链接到它们的原始数据库。除其他功能外,它通过解析 PubMed Central (PMC) 中所有研究文章的全文,识别文献中提到的 PDB ID,并将结果作为简单的 XML Web Services 公开。在这里,我们通过使用这些服务来查找文献中提到的 PDB ID,并随后检索这些文章的摘要、图像和文本摘录,将 BioLit 结果与 RCSB PDB 网站集成。新的 RCSB PDB 文献视图允许浏览提到给定结构的文章的图像和摘要。提供了支持查询这些服务的客户端库 (Java)。
如这里与 RCSB PDB 展示的那样,文献和网站之间的集成提供了更广泛的视角,了解给定结构是如何被分析和使用的。这种方法即使在论文中没有正式引用,也可以检测到 PDB 结构的提及。还可以识别通过同一文献引用相关的其他结构,可能提供新的科学见解。据我们所知,这是首次以这种方式集成数据库和文献,这说明了开放和免费访问数据库和文献内容所带来的机会。