Nédellec Claire, Sauvion Clara, Bossy Robert, Borovikova Mariya, Deléger Louise
Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France.
TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France.
PLoS One. 2024 Jun 13;19(6):e0305475. doi: 10.1371/journal.pone.0305475. eCollection 2024.
Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. A growing number of plant molecular information networks provide interlinked interoperable data to support the discovery of gene-phenotype interactions. A large body of scientific literature and observational data obtained in-field and under controlled conditions document wheat breeding experiments. The cross-referencing of this complementary information is essential. Text from databases and scientific publications has been identified early on as a relevant source of information. However, the wide variety of terms used to refer to traits and phenotype values makes it difficult to find and cross-reference the textual information, e.g. simple dictionary lookup methods miss relevant terms. Corpora with manually annotated examples are thus needed to evaluate and train textual information extraction methods. While several corpora contain annotations of human and animal phenotypes, no corpus is available for plant traits. This hinders the evaluation of text mining-based crop knowledge graphs (e.g. AgroLD, KnetMiner, WheatIS-FAIDARE) and limits the ability to train machine learning methods and improve the quality of information. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 528 PubMed references that are fully annotated by trait, phenotype, and species. We address the interoperability challenge of crossing sparse assay data and publications by using the Wheat Trait and Phenotype Ontology to normalize trait mentions and the species taxonomy of the National Center for Biotechnology Information to normalize species. The paper describes the construction of the corpus. A study of the performance of state-of-the-art language models for both named entity recognition and linking tasks trained on the corpus shows that it is suitable for training and evaluation. This corpus is currently the most comprehensive manually annotated corpus for natural language processing studies on crop phenotype information from the literature.
小麦品种展现出多种多样的性状和表型。将它们与遗传变异性联系起来对于更短且更高效的小麦育种计划至关重要。越来越多的植物分子信息网络提供相互关联且可互操作的数据,以支持基因 - 表型相互作用的发现。大量在田间和受控条件下获得的科学文献及观测数据记录了小麦育种实验。这种互补信息的交叉引用至关重要。数据库和科学出版物中的文本很早就被确定为相关信息来源。然而,用于指代性状和表型值的术语种类繁多,使得查找和交叉引用文本信息变得困难,例如简单的字典查找方法会遗漏相关术语。因此,需要带有手动注释示例的语料库来评估和训练文本信息提取方法。虽然有几个语料库包含人类和动物表型的注释,但尚无针对植物性状的语料库。这阻碍了对基于文本挖掘的作物知识图谱(如AgroLD、KnetMiner、WheatIS - FAIDARE)的评估,并限制了训练机器学习方法和提高信息质量的能力。普通小麦性状语料库是小麦性状和表型的新金标准。它由528篇PubMed参考文献组成,这些文献按性状、表型和物种进行了全面注释。我们通过使用小麦性状和表型本体来规范性状提及,并利用美国国立生物技术信息中心的物种分类法来规范物种,从而解决交叉稀疏测定数据和出版物的互操作性挑战。本文描述了该语料库的构建。对在该语料库上训练的用于命名实体识别和链接任务的最先进语言模型的性能研究表明,它适用于训练和评估。这个语料库目前是用于从文献中进行作物表型信息自然语言处理研究的最全面的手动注释语料库。