Genome Informatics, Institute of Human Genetics, Faculty of Medicine, University of Duisburg-Essen, Essen, Germany.
Integr Biol (Camb). 2012 Jul;4(7):805-12. doi: 10.1039/c2ib00126h. Epub 2012 Jun 15.
Biochemical research has yielded an extensive amount of information about dependencies between protein interactions, as generated by allosteric regulations, steric hindrance and other mechanisms. Collectively, this information is valuable for understanding large intracellular protein networks. However, this information is sparsely distributed among millions of publications and documented as freely styled text meant for manual reading. Here we develop a computational approach for extracting information about interaction dependencies from large numbers of publications. First, keyword-based tokenization reduces full papers to short strings, facilitating an efficient search for patterns that are likely to indicate descriptions of interaction dependencies. Sentences that match such patterns are extracted, thereby reducing the amount of text to be read by human curators. Application of this approach to the integrin adhesome network extracted from 59,933 papers 208 short statements, close to half of which indeed describe interaction dependencies. We visualize the obtained hypernetwork of dependencies and illustrate that these dependencies confine the feasible mechanisms of adhesion sites assembly and generate testable hypotheses about their switchability.
生化研究产生了大量关于蛋白质相互作用之间的依赖关系的信息,这些依赖关系是由变构调节、空间位阻和其他机制产生的。这些信息对于理解大型细胞内蛋白质网络非常有价值。然而,这些信息在数百万篇文献中分布稀疏,并以自由风格的文本形式记录,以便人工阅读。在这里,我们开发了一种从大量文献中提取相互作用依赖关系信息的计算方法。首先,基于关键字的标记化将全文简化为短字符串,从而可以有效地搜索可能表示相互作用依赖关系描述的模式。提取与这些模式匹配的句子,从而减少了人类编辑者需要阅读的文本量。将这种方法应用于从 59933 篇论文中提取的整合素黏着斑网络,得到了 208 个简短的陈述,其中近一半确实描述了相互作用的依赖关系。我们可视化了获得的依赖关系超网络,并说明了这些依赖关系限制了黏着斑组装的可行机制,并生成了关于其可切换性的可测试假设。