Cui Hong, Macklin James A, Sachs Joel, Reznicek Anton, Starr Julian, Ford Bruce, Penev Lyubomir, Chen Hsin-Liang
University of Arizona, TUCSON, United States of America University of Arizona TUCSON United States of America.
Agriculture and Agri-Food Canada, Ottawa, Canada Agriculture and Agri-Food Canada Ottawa Canada.
Biodivers Data J. 2018 Nov 7(6):e29616. doi: 10.3897/BDJ.6.e29616. eCollection 2018.
Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex relationship between the genome, the environment and phenotypes is largely inaccessible to analysis and important questions related to the evolution of organisms, their diseases or their response to climate change cannot be fully addressed. It takes great effort to manually convert free-text narratives to a computable format before they can be used in large-scale analyses. We argue that this manual curation approach is not a sustainable solution to produce computable phenotypic data for three reasons: 1) it does not scale to all of biodiversity; 2) it does not stop the publication of free-text phenotypes that will continue to need manual curation in the future and, most importantly, 3) It does not solve the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other). Our empirical studies have shown that inter-curator variation is as high as 40% even within a single project. With this level of variation, it is difficult to imagine that data integrated from multiple curation projects can be of high quality. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardised vocabularies (ontologies). We argue that the authors describing phenotypes are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of phenotype descriptions from the moment of publication. A proof of concept project on this idea was funded by NSF ABI in July 2017. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future. Results from this project will be accessible through https://biosemantics.github.io/author-driven-production.
表型可用于多种目的,如定义物种、重建系统发育、诊断疾病或提高作物和动物的生产力,但这些表型数据大多以不可计算的自由文本叙述形式发表。这意味着基因组、环境和表型之间的复杂关系在很大程度上无法进行分析,与生物体进化、疾病或其对气候变化的反应相关的重要问题也无法得到充分解决。在将自由文本叙述手动转换为可计算格式以便用于大规模分析之前,需要付出巨大努力。我们认为,这种手动编目方法不是生成可计算表型数据的可持续解决方案,原因有三点:1)它无法扩展到所有生物多样性;2)它无法阻止自由文本表型的发表,而这些表型未来仍将需要手动编目,最重要的是,3)它没有解决编目人员之间的差异问题(编目人员对表型的解释/转换各不相同)。我们的实证研究表明,即使在单个项目中,编目人员之间的差异也高达40%。在这种差异水平下,很难想象从多个编目项目整合的数据会具有高质量。这种差异的主要原因已被确定为原始表型描述中的语义模糊以及使用标准化词汇(本体)的困难。我们认为,描述表型的作者是解决方案的关键。有了合适的工具和适当的归属,作者应该负责开发项目的语义和本体。这将加快本体开发,并从发表之时起提高表型描述的语义清晰度。2017年7月,美国国家科学基金会(NSF)的农业和食品信息学(ABI)项目资助了一个关于这一想法的概念验证项目。我们寻求读者对所提出方法的意见或批评,以帮助在不久的将来实现基于社区的可计算表型数据生产。该项目的结果可通过https://biosemantics.github.io/author-driven-production获取。