• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

人类单核苷酸多态性提及与唯一数据库标识符关联的挑战。

Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers.

机构信息

Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Department of Bioinformatics, Schloss Birlinghoven, 53754 Sankt Augustin, Germany.

出版信息

BMC Bioinformatics. 2011;12 Suppl 4(Suppl 4):S4. doi: 10.1186/1471-2105-12-S4-S4. Epub 2011 Jul 5.

DOI:10.1186/1471-2105-12-S4-S4
PMID:21992066
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3194196/
Abstract

BACKGROUND

Most information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed.

RESULTS

This article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs. The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html.

CONCLUSIONS

Comparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus.

摘要

背景

大多数关于基因组变异及其与表型关联的信息仅在科学出版物中涵盖,而不在结构化数据库中。这些文本通常使用自然语言描述变异;很少提及数据库标识符。这使得变异的检索、相关文章以及信息提取变得复杂,例如寻找生物学意义。为了克服这些挑战,需要开发将文本中变异的提及映射到数据库标识符的程序。

结果

本文描述了一种变异提及规范化(即与唯一数据库标识符的关联)的工作流程。突出并讨论了单核苷酸多态性(SNP)提及解释中的常见陷阱。在基于包含 527 个 SNP 提及的 296 篇 MEDLINE 摘要的文本语料库上,开发的规范化程序实现了 98.1%的精确性和 67.5%的召回率,可将变异提及与 dbSNP 标识符明确关联。注释语料库可在 http://www.scai.fraunhofer.de/snp-normalization-corpus.html 上免费获取。

结论

类似的方法通常侧重于蛋白质序列上提及的变异,而忽略其他 SNP 提及的问题。这里提出的结果表明,规范化在 DNA 水平上描述的 SNP 比规范化在蛋白质水平上描述的 SNP 更具挑战性。在这个语料库中,出现的歧义性和错误例证了规范化所面临的挑战。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/7f5b679a5897/1471-2105-12-S4-S4-9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/8c60a2f05d37/1471-2105-12-S4-S4-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/0dac106b5d25/1471-2105-12-S4-S4-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/4f93aa90950d/1471-2105-12-S4-S4-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/c0bcb5bfa21b/1471-2105-12-S4-S4-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/09cedd0741df/1471-2105-12-S4-S4-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/d077733ea38b/1471-2105-12-S4-S4-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/3a0fb5e71040/1471-2105-12-S4-S4-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/0cb8b8183298/1471-2105-12-S4-S4-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/7f5b679a5897/1471-2105-12-S4-S4-9.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/8c60a2f05d37/1471-2105-12-S4-S4-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/0dac106b5d25/1471-2105-12-S4-S4-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/4f93aa90950d/1471-2105-12-S4-S4-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/c0bcb5bfa21b/1471-2105-12-S4-S4-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/09cedd0741df/1471-2105-12-S4-S4-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/d077733ea38b/1471-2105-12-S4-S4-6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/3a0fb5e71040/1471-2105-12-S4-S4-7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/0cb8b8183298/1471-2105-12-S4-S4-8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/70af/3194196/7f5b679a5897/1471-2105-12-S4-S4-9.jpg

相似文献

1
Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers.人类单核苷酸多态性提及与唯一数据库标识符关联的挑战。
BMC Bioinformatics. 2011;12 Suppl 4(Suppl 4):S4. doi: 10.1186/1471-2105-12-S4-S4. Epub 2011 Jul 5.
2
Identifying gene-specific variations in biomedical text.识别生物医学文本中特定基因的变异。
J Bioinform Comput Biol. 2007 Dec;5(6):1277-96. doi: 10.1142/s0219720007003156.
3
OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature.OSIRISv1.2:一种用于生物医学文献中基因序列变异的命名实体识别系统。
BMC Bioinformatics. 2008 Feb 5;9:84. doi: 10.1186/1471-2105-9-84.
4
SETH detects and normalizes genetic variants in text.SETH可检测并规范文本中的基因变异。
Bioinformatics. 2016 Sep 15;32(18):2883-5. doi: 10.1093/bioinformatics/btw234. Epub 2016 Jun 2.
5
Extraction of human kinase mutations from literature, databases and genotyping studies.从文献、数据库和基因分型研究中提取人类激酶突变。
BMC Bioinformatics. 2009 Aug 27;10 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-10-S8-S1.
6
Inter-species normalization of gene mentions with GNAT.使用GNAT对基因提及进行种间标准化。
Bioinformatics. 2008 Aug 15;24(16):i126-132. doi: 10.1093/bioinformatics/btn299.
7
MedRefSNP: a database of medically investigated SNPs.医学参考单核苷酸多态性数据库:一个对单核苷酸多态性进行医学研究的数据库。
Hum Mutat. 2009 Mar;30(3):E460-6. doi: 10.1002/humu.20914.
8
LINNAEUS: a species name identification system for biomedical literature.林奈氏:生物医学文献的物种名称识别系统。
BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.
9
CUILESS2016: a clinical corpus applying compositional normalization of text mentions.CUILESS2016:一个应用文本提及成分归一化的临床语料库。
J Biomed Semantics. 2018 Jan 10;9(1):2. doi: 10.1186/s13326-017-0173-6.
10
SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature.SNPPhenA:一个用于从文献中提取单核苷酸多态性与表型的排序关联的语料库。
J Biomed Semantics. 2017 Apr 7;8(1):14. doi: 10.1186/s13326-017-0116-2.

引用本文的文献

1
BELB: a biomedical entity linking benchmark.BELB:一个生物医学实体链接基准。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad698.
2
ResidueFinder: extracting individual residue mentions from protein literature.ResidueFinder:从蛋白质文献中提取单个残基的提及。
J Biomed Semantics. 2021 Jul 21;12(1):14. doi: 10.1186/s13326-021-00243-3.
3
Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature.自动化方法在从生物医学文献中搜索和提取基因组变异信息方面的最新进展。

本文引用的文献

1
Extraction of human kinase mutations from literature, databases and genotyping studies.从文献、数据库和基因分型研究中提取人类激酶突变。
BMC Bioinformatics. 2009 Aug 27;10 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-10-S8-S1.
2
High-performance gene name normalization with GeNo.使用GeNo进行高性能基因名称标准化
Bioinformatics. 2009 Mar 15;25(6):815-21. doi: 10.1093/bioinformatics/btp071. Epub 2009 Feb 2.
3
MedRefSNP: a database of medically investigated SNPs.医学参考单核苷酸多态性数据库:一个对单核苷酸多态性进行医学研究的数据库。
Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa142.
4
Building a PubMed knowledge graph.构建 PubMed 知识图谱。
Sci Data. 2020 Jun 26;7(1):205. doi: 10.1038/s41597-020-0543-2.
5
The SNPcurator: literature mining of enriched SNP-disease associations.SNPcurator:富集 SNP-疾病关联的文献挖掘。
Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay020.
6
Exploiting and assessing multi-source data for supervised biomedical named entity recognition.利用和评估多源数据进行有监督的生物医学命名实体识别。
Bioinformatics. 2018 Jul 15;34(14):2474-2482. doi: 10.1093/bioinformatics/bty152.
7
tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine.tmVar 2.0:整合文献中的基因组变异信息与 dbSNP 和 ClinVar,以用于精准医学。
Bioinformatics. 2018 Jan 1;34(1):80-87. doi: 10.1093/bioinformatics/btx541.
8
Two SNPs in the promoter region of Toll-like receptor 4 gene are not associated with smoking in Saudi Arabia.在沙特阿拉伯,Toll样受体4基因启动子区域的两个单核苷酸多态性与吸烟无关。
Onco Targets Ther. 2017 Feb 9;10:745-752. doi: 10.2147/OTT.S111971. eCollection 2017.
9
Computational Analysis of Damaging Single-Nucleotide Polymorphisms and Their Structural and Functional Impact on the Insulin Receptor.有害单核苷酸多态性的计算分析及其对胰岛素受体的结构和功能影响
Biomed Res Int. 2016;2016:2023803. doi: 10.1155/2016/2023803. Epub 2016 Oct 20.
10
Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts.建立用于挖掘人类遗传变异及其与疾病队列关系的文献基线。
BMC Med Inform Decis Mak. 2016 Jul 18;16 Suppl 1(Suppl 1):68. doi: 10.1186/s12911-016-0294-3.
Hum Mutat. 2009 Mar;30(3):E460-6. doi: 10.1002/humu.20914.
4
Inter-species normalization of gene mentions with GNAT.使用GNAT对基因提及进行种间标准化。
Bioinformatics. 2008 Aug 15;24(16):i126-132. doi: 10.1093/bioinformatics/btn299.
5
OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature.OSIRISv1.2:一种用于生物医学文献中基因序列变异的命名实体识别系统。
BMC Bioinformatics. 2008 Feb 5;9:84. doi: 10.1186/1471-2105-9-84.
6
Singleton SNPs in the human genome and implications for genome-wide association studies.人类基因组中的单核苷酸多态性及其对全基因组关联研究的意义。
Eur J Hum Genet. 2008 Apr;16(4):506-15. doi: 10.1038/sj.ejhg.5201987. Epub 2008 Jan 16.
7
Towards a systematic evaluation of protein mutation extraction systems.迈向蛋白质突变提取系统的系统评估。
J Bioinform Comput Biol. 2007 Dec;5(6):1339-59. doi: 10.1142/s0219720007003193.
8
Identifying gene-specific variations in biomedical text.识别生物医学文本中特定基因的变异。
J Bioinform Comput Biol. 2007 Dec;5(6):1277-96. doi: 10.1142/s0219720007003156.
9
Application of automatic mutation-gene pair extraction to diseases.自动突变-基因对提取在疾病中的应用。
J Bioinform Comput Biol. 2007 Dec;5(6):1261-75. doi: 10.1142/s021972000700317x.
10
Retrieving mutation-specific information for human proteins in UniProt/Swiss-Prot Knowledgebase.在UniProt/Swiss-Prot知识库中检索人类蛋白质的突变特异性信息。
J Bioinform Comput Biol. 2007 Dec;5(6):1215-31. doi: 10.1142/s021972000700320x.