Miller Holly, Norton Catherine N, Sarkar Indra Neil
MBLWHOI Library, Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA 02543, USA.
BMC Res Notes. 2009 Jun 9;2:101. doi: 10.1186/1756-0500-2-101.
GenBank is a public repository of all publicly available molecular sequence data from a range of sources. In addition to relevant metadata (e.g., sequence description, source organism and taxonomy), publication information is recorded in the GenBank data file. The identification of literature associated with a given molecular sequence may be an essential first step in developing research hypotheses. Although many of the publications associated with GenBank records may not be linked into or part of complementary literature databases (e.g., PubMed), GenBank records associated with literature indexed in Medline are identifiable as they contain PubMed identifiers (PMIDs).
Here we show that an analysis of 87,116,501 GenBank sequence files reveals that 42% are associated with a publication or patent. Of these, 71% are associated with PMIDs, and can therefore be linked to a citation record in the PubMed database. The remaining (29%) of publication-associated GenBank entries either do not have PMIDs or cite a publication that is not currently indexed by PubMed. We also identify the journal titles that are linked through citations in the GenBank files to the largest number of sequences.
Our analysis suggests that GenBank contains molecular sequences from a range of disciplines beyond biomedicine, the initial scope of PubMed. The findings thus suggest opportunities to develop mechanisms for integrating biological knowledge beyond the biomedical field.
GenBank是一个来自一系列来源的所有公开可用分子序列数据的公共储存库。除了相关的元数据(例如,序列描述、来源生物体和分类学)之外,出版信息也记录在GenBank数据文件中。识别与给定分子序列相关的文献可能是提出研究假设的关键第一步。尽管许多与GenBank记录相关的出版物可能未链接到补充文献数据库(例如,PubMed)中或不是其一部分,但与Medline索引文献相关的GenBank记录是可识别的,因为它们包含PubMed标识符(PMID)。
我们在此表明,对87,116,501个GenBank序列文件的分析显示,42%与出版物或专利相关。其中,71%与PMID相关,因此可以链接到PubMed数据库中的引用记录。其余(29%)与出版物相关的GenBank条目要么没有PMID,要么引用了当前未被PubMed索引的出版物。我们还确定了通过GenBank文件中的引用与最多序列相关联的期刊标题。
我们的分析表明,GenBank包含来自生物医学之外一系列学科的分子序列,而生物医学是PubMed的初始范围。因此,这些发现表明有机会开发整合生物医学领域之外生物知识的机制。