Department of Computer Science, Sogang University, Seoul, Korea.
BMC Bioinformatics. 2010 Feb 25;11:107. doi: 10.1186/1471-2105-11-107.
The construction of interaction networks between proteins is central to understanding the underlying biological processes. However, since many useful relations are excluded in databases and remain hidden in raw text, a study on automatic interaction extraction from text is important in bioinformatics field.
Here, we suggest two kinds of kernel methods for genic interaction extraction, considering the structural aspects of sentences. First, we improve our prior dependency kernel by modifying the kernel function so that it can involve various substructures in terms of (1) e-walks, (2) partial match, (3) non-contiguous paths, and (4) different significance of substructures. Second, we propose the walk-weighted subsequence kernel to parameterize non-contiguous syntactic structures as well as semantic roles and lexical features, which makes learning structural aspects from a small amount of training data effective. Furthermore, we distinguish the significances of parameters such as syntactic locality, semantic roles, and lexical features by varying their weights.
We addressed the genic interaction problem with various dependency kernels and suggested various structural kernel scenarios based on the directed shortest dependency path connecting two entities. Consequently, we obtained promising results over genic interaction data sets with the walk-weighted subsequence kernel. The results are compared using automatically parsed third party protein-protein interaction (PPI) data as well as perfectly syntactic labeled PPI data.
蛋白质间相互作用网络的构建对于理解潜在的生物过程至关重要。然而,由于许多有用的关系在数据库中被排除在外,并且隐藏在原始文本中,因此从文本中自动提取相互作用的研究在生物信息学领域非常重要。
在这里,我们考虑句子的结构方面,提出了两种用于基因相互作用提取的核方法。首先,我们通过修改核函数来改进我们先前的依赖核,以便它可以涉及(1)e-游走、(2)部分匹配、(3)非连续路径和(4)不同子结构的重要性等各种子结构。其次,我们提出了游走加权子序列核,以便将非连续的句法结构以及语义角色和词汇特征参数化,从而使从少量训练数据中学习结构方面变得有效。此外,我们通过改变它们的权重来区分句法局部性、语义角色和词汇特征等参数的重要性。
我们使用各种依赖核解决了基因相互作用问题,并根据连接两个实体的有向最短依赖路径提出了各种结构核方案。因此,我们使用游走加权子序列核在基因相互作用数据集上获得了有希望的结果。结果与自动解析的第三方蛋白质-蛋白质相互作用(PPI)数据以及完全句法标记的 PPI 数据进行了比较。