Bokharaeian Behrouz, Diaz Alberto, Taghizadeh Nasrin, Chitsaz Hamidreza, Chavoshinejad Ramyar
Facultad informatica, Complutense University of Madrid, Calle Profesor José García Santesmases, 9, 28040, Madrid, Spain.
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran.
J Biomed Semantics. 2017 Apr 7;8(1):14. doi: 10.1186/s13326-017-0116-2.
Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations.
In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks.
The agreement between annotators was measured by Cohen's Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639 .
Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations.
Not Applicable.
单核苷酸多态性(SNP)是影响常见疾病和表型的最重要的遗传变异类型之一。最近,已经开发了一些语料库和方法来从文本中提取突变和疾病。然而,目前还没有一个用于从文本中提取关联关系的语料库,该语料库带有基于语言学的否定词、情态标记、中性候选词以及关联关系的置信度注释。
在本研究中,介绍了生成SNPPhenA语料库的不同步骤。这些步骤包括自动命名实体识别(NER),随后对SNP和表型名称进行人工注释,对SNP-表型关联及其置信度进行注释,以及对情态标记进行注释。此外,生成的语料库还带有否定范围和线索以及中性候选词的注释,这些在与提取任务相关的否定和情态现象方面起着至关重要的作用。
通过科恩卡帕系数来衡量注释者之间的一致性,所得分数表明了语料库的可靠性。注释关联关系的卡帕分数为0.79,关联关系置信度的卡帕分数为0.80。除了我们首次与提取排序后的SNP-表型关联相关的实验结果外,还给出了语料库注释特征的基本统计数据。编写的指南文件使语料库的使用更加方便和容易。语料库、指南和注释者间一致性分析可在语料库网站上获取:http://nil.fdi.ucm.es/?q=node/639 。
从文章中确定SNP-表型关联的置信度有助于识别关联的强度,这反过来可以帮助基因组科学家确定表型可塑性和环境因素的重要性。此外,我们对该语料库的首次实验表明,可以利用基于语言学的确信度以及其他非语言特征来估计观察到的SNP-表型关联的强度。
不适用。