Braun Ian R, Lawrence-Dill Carolyn J
Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, IA, United States.
Interdepartmental Bioinformatics and Computational Biology, Iowa State University, Ames, IA, United States.
Front Plant Sci. 2020 Jan 10;10:1629. doi: 10.3389/fpls.2019.01629. eCollection 2019.
Natural language descriptions of plant phenotypes are a rich source of information for genetics and genomics research. We computationally translated descriptions of plant phenotypes into structured representations that can be analyzed to identify biologically meaningful associations. These representations include the entity-quality (EQ) formalism, which uses terms from biological ontologies to represent phenotypes in a standardized, semantically rich format, as well as numerical vector representations generated using natural language processing (NLP) methods (such as the bag-of-words approach and document embedding). We compared resulting phenotype similarity measures to those derived from manually curated data to determine the performance of each method. Computationally derived EQ and vector representations were comparably successful in recapitulating biological truth to representations created through manual EQ statement curation. Moreover, NLP methods for generating vector representations of phenotypes are scalable to large quantities of text because they require no human input. These results indicate that it is now possible to computationally and automatically produce and populate large-scale information resources that enable researchers to query phenotypic descriptions directly.
植物表型的自然语言描述是遗传学和基因组学研究的丰富信息来源。我们通过计算将植物表型描述转化为结构化表示,以便进行分析以识别具有生物学意义的关联。这些表示包括实体-质量(EQ)形式主义,它使用来自生物本体的术语以标准化、语义丰富的格式表示表型,以及使用自然语言处理(NLP)方法(如词袋法和文档嵌入)生成的数值向量表示。我们将所得的表型相似性度量与从人工整理数据得出的度量进行比较,以确定每种方法的性能。通过计算得出的EQ和向量表示在重现生物学真实性方面与通过人工EQ语句整理创建的表示相当成功。此外,用于生成表型向量表示的NLP方法可扩展到大量文本,因为它们不需要人工输入。这些结果表明,现在可以通过计算自动生成并填充大规模信息资源,使研究人员能够直接查询表型描述。