Department of Statistics, Florida State University, Tallahassee, FL, USA.
School of Information, Florida State University, Tallahassee, FL, USA.
Database (Oxford). 2019 Jan 1;2019:bay138. doi: 10.1093/database/bay138.
Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical-protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.
有关化合物和蛋白质相互作用的信息对于理解生物过程的调控和治疗药物的开发是不可或缺的。从生物医学文献中手动提取这些信息非常耗时耗力。在这项研究中,我们提出了一种从给定文本中自动提取化学-蛋白质相互作用(CPI)的计算方法。我们的方法从句子中提取 CPI 对和 CPI 三胞胎,其中 CPI 对由化合物和蛋白质名称组成,CPI 三胞胎由 CPI 对以及描述它们关系的相互作用词组成。我们从句子中提取了一组多样化的特征,用于构建多个机器学习模型。我们的模型包含简单特征和更复杂的特征,简单特征可以直接从句子中计算得出,而复杂特征则是使用句子结构分析技术从句子中提取的。例如,一组特征是基于从句子解析得到的依存关系图中 CPI 对之间或 CPI 三胞胎之间的最短路径提取的。我们设计了一个三阶段方法来预测多种 CPI 类别。在 BioCreative VI 挑战赛的第 5 轮中,我们的方法在使用非深度学习方法的系统中表现最好,并且优于几个基于深度学习的系统。我们在这项研究中设计的特征是信息丰富的,可以应用于包括深度学习在内的其他机器学习方法。