School of Computer Science, University of Manchester, Manchester, United Kingdom.
PLoS One. 2011 Mar 29;6(3):e14780. doi: 10.1371/journal.pone.0014780.
Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, patricbrc.org). We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents.
专门的生物系统研究常常受到术语不一致的阻碍,尤其是在跨物种的情况下。在细菌 IV 型分泌系统中,同一组同源基因可能有十多个不同的名称。基于生物过程、细胞成分、分子功能和微生物物种对研究出版物进行分类,应该可以提高文献搜索的准确性和召回率,使研究人员能够通过 Pathosystems Resource Integration Center(PATRIC,patricbrc.org)等资源跟上文献的指数级增长。我们开发了与 IV 型分泌系统相关的四个实体的命名实体识别(NER)工具:1)细菌名称,2)生物过程,3)分子功能,4)细胞成分。这些实体对于发病机制和毒力研究很重要,但受到的关注不如其他实体(如基因和蛋白质)多。基于带注释的语料库、大型领域术语资源和机器学习技术,我们为这些实体开发了识别器。细菌、生物过程和分子功能的准确率都达到了 80%以上。对比实验突出了替代识别策略的有效性;在对比文档集上进行的术语提取结果证明了这些类对于识别与 T4SS 相关的文档的有用性。