Department of Computer Science, University of Missouri, Columbia, MO 65211, USA.
Bioinformatics. 2012 Mar 15;28(6):867-75. doi: 10.1093/bioinformatics/bts042. Epub 2012 Jan 27.
In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data.
Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.
在传染病中,病原体进入宿主生物体并突破其免疫防御的策略通常涉及宿主和病原体蛋白之间的相互作用。目前,宿主-病原体相互作用(HPIs)的实验数据分散在多个数据库中,这些数据库通常专门针对特定的疾病或宿主生物体。从生物医学文献中自动提取 HPIs 的准确而高效的方法对于创建 HPI 数据的统一存储库至关重要。
在这里,我们介绍并比较了两种从 PubMed 出版物的标题或摘要中自动检测是否包含 HPI 数据的新方法,并提取了参与相互作用的生物体和蛋白质的信息。第一种方法是基于特征的有监督学习方法,使用支持向量机(SVM)。SVM 模型是基于从各个句子中提取的特征进行训练的。这些特征包括宿主/病原体生物体和相应的蛋白质或基因的名称、描述 HPI 特定信息的关键字、更一般的蛋白质-蛋白质相互作用信息、实验方法和其他统计信息。基于语言的方法使用链接语法分析器,并结合从训练示例中得出的语义模式。这些方法已在经过人工整理的 HPI 数据上进行了训练和测试。与基于现有蛋白质-蛋白质相互作用文献挖掘方法的简单方法相比,我们的方法在分类任务中表现出更高的准确性和召回率。最准确的基于特征的方法的准确率取决于测试协议,在 66%-73%之间。