• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用本体论对表型进行注释:自然语言处理系统的培训和评估的黄金标准。

Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems.

机构信息

University of South Dakota, Vermillion, SD, USA.

University of North Carolina at Greensboro, Greensboro, NC, USA.

出版信息

Database (Oxford). 2018 Jan 1;2018:bay110. doi: 10.1093/database/bay110.

DOI:10.1093/database/bay110
PMID:30576485
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6301375/
Abstract

Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity-quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine-human consistency, or similarity, was significantly lower than inter-curator (human-human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.

摘要

生物学术文献中大量存在对生物体表型的自然语言描述,这是生物学研究的主要对象。使用本体将这些表型表达为逻辑语句,将使来自不同系统的表型信息能够进行大规模分析。然而,要使这些表型描述能够适应机器推理,需要大量的人力。已经开发了自然语言处理工具来促进这项任务,这些工具的培训和评估依赖于高质量的、手动注释的黄金标准数据集的可用性。我们描述了一个用于进化生物学的专家编纂的注释表型黄金标准数据集的开发。该黄金标准是为 Phenoscape 项目中复杂的比较表型的编纂而开发的。它是由三位编纂者达成共识创建的,由不同复杂程度的实体质量表达式组成。我们使用黄金标准来评估由人类编纂者创建的注释和由 Semantic CharaParser 工具生成的注释。使用四个可以考虑两个表型注释中术语之间任何关系程度的注释准确性度量标准,我们发现机器与人类的一致性,或相似性,明显低于编纂者(人类对人类)之间的一致性。令人惊讶的是,允许编纂者访问外部信息并没有显著提高他们的注释与黄金标准的相似性,也没有对编纂者之间的一致性产生显著影响。我们发现,在添加新的相关本体术语后,机器注释与黄金标准的相似性增加了。对特征描述的原始作者进行评估表明,黄金标准注释比编纂者或机器注释更能代表他们的意图。这些发现为更好地设计软件以增强人类编纂者提供了方向,并且黄金标准语料库的使用将允许培训和评估新工具,以提高大规模表型注释的准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/9c401420f036/bay110f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/508c0831382e/bay110f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/760e2bfc41b4/bay110f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/2397f8f52b68/bay110f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/1c0bee7fbf15/bay110f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/9c401420f036/bay110f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/508c0831382e/bay110f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/760e2bfc41b4/bay110f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/2397f8f52b68/bay110f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/1c0bee7fbf15/bay110f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1d6/6301375/9c401420f036/bay110f5.jpg

相似文献

1
Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems.使用本体论对表型进行注释:自然语言处理系统的培训和评估的黄金标准。
Database (Oxford). 2018 Jan 1;2018:bay110. doi: 10.1093/database/bay110.
2
Moving the mountain: analysis of the effort required to transform comparative anatomy into computable anatomy.移山:对将比较解剖学转化为可计算解剖学所需努力的分析。
Database (Oxford). 2015 May 13;2015:bav040. doi: 10.1093/database/bav040. Print 2015.
3
Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL).用于提取生物表达语言(BEL)中编码的因果关系的训练和评估语料库。
Database (Oxford). 2016 Aug 23;2016. doi: 10.1093/database/baw113. Print 2016.
4
Modifier Ontologies for frequency, certainty, degree, and coverage phenotype modifier.用于频率、确定性、程度和覆盖表型修饰符的修饰符本体。
Biodivers Data J. 2018 Nov 28(6):e29232. doi: 10.3897/BDJ.6.e29232. eCollection 2018.
5
Evolutionary characters, phenotypes and ontologies: curating data from the systematic biology literature.进化特征、表型和本体论:从系统生物学文献中整理数据。
PLoS One. 2010 May 20;5(5):e10708. doi: 10.1371/journal.pone.0010708.
6
The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions.BioC-BioGRID语料库:为蛋白质-蛋白质和基因相互作用的编目而注释的全文文章。
Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw147. Print 2017.
7
Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt.通过自动化方法加速文章注释:NextProt 对 neXtA5 内容管理支持工具的评估。
Database (Oxford). 2018 Jan 1;2018:bay129. doi: 10.1093/database/bay129.
8
Annotation of epilepsy clinic letters for natural language processing.癫痫门诊信件的自然语言处理标注。
J Biomed Semantics. 2024 Sep 15;15(1):17. doi: 10.1186/s13326-024-00316-z.
9
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
10
Phenotype annotation with the ontology of microbial phenotypes (OMP).使用微生物表型本体(OMP)进行表型注释。
J Biomed Semantics. 2019 Jul 15;10(1):13. doi: 10.1186/s13326-019-0205-5.

引用本文的文献

1
Helping authors produce FAIR taxonomic data: evaluation of an author-driven phenotype data production prototype.帮助作者生成可实现公平原则的分类学数据:对作者驱动的表型数据生成原型的评估
Database (Oxford). 2025 Jan 29;2025. doi: 10.1093/database/baae097.
2
Computable species descriptions and nanopublications: applying ontology-based technologies to dung beetles (Coleoptera, Scarabaeinae).可计算的物种描述与纳米出版物:将基于本体的技术应用于蜣螂(鞘翅目,金龟亚科)
Biodivers Data J. 2024 Jun 13;12:e121562. doi: 10.3897/BDJ.12.e121562. eCollection 2024.
3
Authors' attitude toward adopting a new workflow to improve the computability of phenotype publications.

本文引用的文献

1
The linguistic problem of morphology: structure versus homology and the standardization of morphological data.形态学的语言问题:结构与同源性以及形态学数据的标准化
Cladistics. 2010 Jun;26(3):301-325. doi: 10.1111/j.1096-0031.2009.00286.x. Epub 2009 Oct 7.
2
Towards a semantic approach to numerical tree inference in phylogenetics.迈向系统发育学中数值树推断的语义方法。
Cladistics. 2018 Apr;34(2):200-224. doi: 10.1111/cla.12195. Epub 2017 Mar 10.
3
Biocuration: Distilling data into knowledge.生物信息学数据管理:从数据中提取知识。
作者对采用新工作流程以提高表型出版物可计算性的态度。
Database (Oxford). 2022 Feb 2;2022. doi: 10.1093/database/baac001.
4
FAIR data representation in times of eScience: a comparison of instance-based and class-based semantic representations of empirical data using phenotype descriptions as example.在电子科学时代实现 FAIR 数据表示:以表型描述为例比较基于实例和基于类的经验数据语义表示。
J Biomed Semantics. 2021 Nov 25;12(1):20. doi: 10.1186/s13326-021-00254-0.
5
Which methods are the most effective in enabling novice users to participate in ontology creation? A usability study.哪些方法对于帮助新手用户参与本体创建最为有效?一项可用性研究。
Database (Oxford). 2021 Jun 22;2021. doi: 10.1093/database/baab035.
6
Challenges for FAIR-compliant description and comparison of crop phenotype data with standardized controlled vocabularies.符合 FAIR 原则的作物表型数据描述和标准化控制词汇比较面临的挑战。
Database (Oxford). 2021 May 15;2021. doi: 10.1093/database/baab028.
7
Measurement Recorder: developing a useful tool for making species descriptions that produces computable phenotypes.记录器:开发一种有用的工具来进行物种描述,生成可计算的表型。
Database (Oxford). 2020 Nov 20;2020. doi: 10.1093/database/baaa079.
8
Automated Methods Enable Direct Computation on Phenotypic Descriptions for Novel Candidate Gene Prediction.自动化方法可实现对表型描述的直接计算以进行新型候选基因预测。
Front Plant Sci. 2020 Jan 10;10:1629. doi: 10.3389/fpls.2019.01629. eCollection 2019.
9
Craniodental and Postcranial Characters of Non-Avian Dinosauria Often Imply Different Trees.非鸟恐龙的颅骨牙齿特征和颅后骨骼特征往往暗示着不同的系统发育树。
Syst Biol. 2020 Jul 1;69(4):638-659. doi: 10.1093/sysbio/syz077.
10
Modifier Ontologies for frequency, certainty, degree, and coverage phenotype modifier.用于频率、确定性、程度和覆盖表型修饰符的修饰符本体。
Biodivers Data J. 2018 Nov 28(6):e29232. doi: 10.3897/BDJ.6.e29232. eCollection 2018.
PLoS Biol. 2018 Apr 16;16(4):e2002846. doi: 10.1371/journal.pbio.2002846. eCollection 2018 Apr.
4
The Human Phenotype Ontology in 2017.2017年的人类表型本体论。
Nucleic Acids Res. 2017 Jan 4;45(D1):D865-D876. doi: 10.1093/nar/gkw1039. Epub 2016 Nov 28.
5
Toward Synthesizing Our Knowledge of Morphology: Using Ontologies and Machine Reasoning to Extract Presence/Absence Evolutionary Phenotypes across Studies.迈向整合我们的形态学知识:利用本体和机器推理提取跨研究的存在/缺失进化表型。
Syst Biol. 2015 Nov;64(6):936-52. doi: 10.1093/sysbio/syv031. Epub 2015 May 26.
6
Moving the mountain: analysis of the effort required to transform comparative anatomy into computable anatomy.移山:对将比较解剖学转化为可计算解剖学所需努力的分析。
Database (Oxford). 2015 May 13;2015:bav040. doi: 10.1093/database/bav040. Print 2015.
7
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.用于生物医学概念识别的多语言金标准语料库:Mantra GSC。
J Am Med Inform Assoc. 2015 Sep;22(5):948-56. doi: 10.1093/jamia/ocv037. Epub 2015 May 6.
8
Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.从与表型特别相关的生物医学文本中生成银标准概念注释。
PLoS One. 2015 Jan 21;10(1):e0116040. doi: 10.1371/journal.pone.0116040. eCollection 2015.
9
Finding our way through phenotypes.通过表型找到方向。
PLoS Biol. 2015 Jan 6;13(1):e1002033. doi: 10.1371/journal.pbio.1002033. eCollection 2015 Jan.
10
Annotation of phenotypic diversity: decoupling data curation and ontology curation using Phenex.表型多样性注释:使用Phenex解耦数据管理与本体管理。
J Biomed Semantics. 2014 Nov 5;5(1):45. doi: 10.1186/2041-1480-5-45. eCollection 2014.