Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Department of Bioinformatics, Schloss Birlinghoven, 53754 Sankt Augustin, Germany.
BMC Bioinformatics. 2011;12 Suppl 4(Suppl 4):S4. doi: 10.1186/1471-2105-12-S4-S4. Epub 2011 Jul 5.
Most information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed.
This article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs. The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html.
Comparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus.
大多数关于基因组变异及其与表型关联的信息仅在科学出版物中涵盖,而不在结构化数据库中。这些文本通常使用自然语言描述变异;很少提及数据库标识符。这使得变异的检索、相关文章以及信息提取变得复杂,例如寻找生物学意义。为了克服这些挑战,需要开发将文本中变异的提及映射到数据库标识符的程序。
本文描述了一种变异提及规范化(即与唯一数据库标识符的关联)的工作流程。突出并讨论了单核苷酸多态性(SNP)提及解释中的常见陷阱。在基于包含 527 个 SNP 提及的 296 篇 MEDLINE 摘要的文本语料库上,开发的规范化程序实现了 98.1%的精确性和 67.5%的召回率,可将变异提及与 dbSNP 标识符明确关联。注释语料库可在 http://www.scai.fraunhofer.de/snp-normalization-corpus.html 上免费获取。
类似的方法通常侧重于蛋白质序列上提及的变异,而忽略其他 SNP 提及的问题。这里提出的结果表明,规范化在 DNA 水平上描述的 SNP 比规范化在蛋白质水平上描述的 SNP 更具挑战性。在这个语料库中,出现的歧义性和错误例证了规范化所面临的挑战。