Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA.
BMC Bioinformatics. 2014 Feb 5;15:43. doi: 10.1186/1471-2105-15-43.
The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database and have developed an R package called pmcXML to automatically mine and extract locus tags from full text, tables and supplements.
We mined locus tags from 1835 OA publications in ten microbial genomes and extracted tags mentioned in 30,891 sentences in main text and 20,489 rows in tables. We identified locus tag pairs marking the start and end of a region such as an operon or genomic island and expanded these ranges to add another 13,043 tags. We also searched for locus tags in supplementary tables and publications outside the OA subset in Burkholderia pseudomallei K96243 for comparison. There were 168 publications containing 48,470 locus tags and 83% of mentions were from supplementary materials and 9% from publications outside the OA subset.
B. pseudomallei locus tags within the full text and tables of OA publications represent only a small fraction of the total mentions in the literature. For microbial genomes with very few functionally characterized proteins, the locus tags mentioned in supplementary tables and within ranges like genomic islands contain the majority of locus tags. Significantly, the functions in the R package provide access to additional resources in the OA subset that are not currently indexed or returned by searching PMC.
科学文献中包含数以百万计的微生物基因标识符,这些标识符存在于全文和表格中,但这些注释很少被整合到公共序列数据库中。我们建议利用 PubMed Central (PMC) 的开放获取 (OA) 子集作为基因注释数据库,并开发了一个名为 pmcXML 的 R 包,用于自动从全文、表格和补充材料中挖掘和提取基因座标签。
我们从十个微生物基因组的 1835 篇 OA 出版物中挖掘了基因座标签,并从正文的 30891 个句子和表格的 20489 行中提取了标签。我们确定了标记基因座标签的基因座标签对,这些标签对标记了操纵子或基因组岛等区域的起始和结束位置,并扩展了这些范围,添加了另外 13043 个标签。我们还在 Burkholderia pseudomallei K96243 的补充表格和 OA 子集中之外的出版物中搜索基因座标签进行比较。有 168 篇出版物包含 48470 个基因座标签,90%的提及来自补充材料,9%来自 OA 子集之外的出版物。
OA 出版物的全文和表格中的 B. pseudomallei 基因座标签仅代表文献中总提及的一小部分。对于功能特征蛋白很少的微生物基因组,补充表格和基因组岛等范围内提到的基因座标签包含了大部分基因座标签。重要的是,R 包中的功能提供了对 OA 子集中未被索引或通过搜索 PMC 返回的其他资源的访问权限。