Masseroli Marco, Kilicoglu Halil, Lang François-Michel, Rindflesch Thomas C
Bioengineering Department, Politecnico di Milano, Milan, Italy.
BMC Bioinformatics. 2006 Jun 8;7:291. doi: 10.1186/1471-2105-7-291.
Genomic functional information is valuable for biomedical research. However, such information frequently needs to be extracted from the scientific literature and structured in order to be exploited by automatic systems. Natural language processing is increasingly used for this purpose although it inherently involves errors. A postprocessing strategy that selects relations most likely to be correct is proposed and evaluated on the output of SemGen, a system that extracts semantic predications on the etiology of genetic diseases. Based on the number of intervening phrases between an argument and its predicate, we defined a heuristic strategy to filter the extracted semantic relations according to their likelihood of being correct. We also applied this strategy to relations identified with co-occurrence processing. Finally, we exploited postprocessed SemGen predications to investigate the genetic basis of Parkinson's disease.
The filtering procedure for increased precision is based on the intuition that arguments which occur close to their predicate are easier to identify than those at a distance. For example, if gene-gene relations are filtered for arguments at a distance of 1 phrase from the predicate, precision increases from 41.95% (baseline) to 70.75%. Since this proximity filtering is based on syntactic structure, applying it to the results of co-occurrence processing is useful, but not as effective as when applied to the output of natural language processing. In an effort to exploit SemGen predications on the etiology of disease after increasing precision with postprocessing, a gene list was derived from extracted information enhanced with postprocessing filtering and was automatically annotated with GFINDer, a Web application that dynamically retrieves functional and phenotypic information from structured biomolecular resources. Two of the genes in this list are likely relevant to Parkinson's disease but are not associated with this disease in several important databases on genetic disorders.
Information based on the proximity postprocessing method we suggest is of sufficient quality to be profitably used for subsequent applications aimed at uncovering new biomedical knowledge. Although proximity filtering is only marginally effective for enhancing the precision of relations extracted with co-occurrence processing, it is likely to benefit methods based, even partially, on syntactic structure, regardless of the relation.
基因组功能信息对生物医学研究很有价值。然而,此类信息常常需要从科学文献中提取并进行结构化处理,以便自动系统加以利用。自然语言处理越来越多地用于此目的,尽管其本身存在错误。本文提出了一种后处理策略,该策略选择最可能正确的关系,并在SemGen的输出上进行评估。SemGen是一个提取关于遗传疾病病因的语义谓词的系统。基于一个论元和它的谓词之间插入短语的数量,我们定义了一种启发式策略,根据提取的语义关系正确的可能性对其进行过滤。我们还将此策略应用于通过共现处理识别的关系。最后,我们利用经过后处理的SemGen谓词来研究帕金森病的遗传基础。
提高精度的过滤过程基于这样一种直觉,即与谓词距离近的论元比距离远的论元更容易识别。例如,如果对基因-基因关系过滤与谓词距离为1个短语的论元,精度将从41.95%(基线)提高到70.75%。由于这种邻近性过滤基于句法结构,将其应用于共现处理的结果是有用的,但不如应用于自然语言处理的输出有效。为了在后处理提高精度后利用SemGen对疾病病因的谓词,从经过后处理过滤增强的提取信息中导出了一个基因列表,并用GFINDer自动注释。GFINDer是一个网络应用程序,可从结构化生物分子资源中动态检索功能和表型信息。该列表中的两个基因可能与帕金森病相关,但在几个重要的遗传疾病数据库中并未与该疾病相关联。
我们提出的基于邻近性后处理方法的信息质量足以用于后续旨在揭示新生物医学知识的应用。尽管邻近性过滤对于提高通过共现处理提取的关系的精度仅具有微弱的效果,但它可能有益于即使部分基于句法结构的方法,而与关系无关。