Department of Computer Science, Tufts University, Medford, MA 02155, United States.
Google Research, Cambridge, MA 02142, Unites States.
Bioinformatics. 2023 Aug 1;39(8). doi: 10.1093/bioinformatics/btad456.
Accurately predicting the likelihood of interaction between two objects (compound-protein sequence, user-item, author-paper, etc.) is a fundamental problem in Computer Science. Current deep-learning models rely on learning accurate representations of the interacting objects. Importantly, relationships between the interacting objects, or features of the interaction, offer an opportunity to partition the data to create multi-views of the interacting objects. The resulting congruent and non-congruent views can then be exploited via contrastive learning techniques to learn enhanced representations of the objects.
We present a novel method, Contrastive Stratification for Interaction Prediction (CSI), to stratify (partition) a dataset in a manner that can be exploited via Contrastive Multiview Coding to learn embeddings that maximize the mutual information across congruent data views. CSI assigns a key and multiple views to each data point, where data partitions under a particular key form congruent views of the data. We showcase the effectiveness of CSI by applying it to the compound-protein sequence interaction prediction problem, a pressing problem whose solution promises to expedite drug delivery (drug-protein interaction prediction), metabolic engineering, and synthetic biology (compound-enzyme interaction prediction) applications. Comparing CSI with a baseline model that does not utilize data stratification and contrastive learning, and show gains in average precision ranging from 13.7% to 39% using compounds and sequences as keys across multiple drug-target and enzymatic datasets, and gains ranging from 16.9% to 63% using reaction features as keys across enzymatic datasets.
Code and dataset available at https://github.com/HassounLab/CSI.
准确预测两个对象(化合物-蛋白质序列、用户-项目、作者-论文等)之间相互作用的可能性是计算机科学中的一个基本问题。当前的深度学习模型依赖于学习交互对象的准确表示。重要的是,交互对象之间的关系或交互的特征为划分数据提供了机会,从而创建交互对象的多视图。然后,可以通过对比学习技术利用这些一致和不一致的视图来学习对象的增强表示。
我们提出了一种新方法,即交互预测的对比分层 (CSI),以分层(分区)数据集的方式,通过对比多视图编码来利用,以学习最大程度地提高一致数据视图之间互信息的嵌入。CSI 为每个数据点分配一个键和多个视图,其中特定键下的数据分区形成数据的一致视图。我们通过将 CSI 应用于化合物-蛋白质序列相互作用预测问题来展示其有效性,这是一个紧迫的问题,其解决方案有望加速药物输送(药物-蛋白质相互作用预测)、代谢工程和合成生物学(化合物-酶相互作用预测)应用。将 CSI 与不利用数据分层和对比学习的基线模型进行比较,并在多个药物-靶标和酶数据集上使用化合物和序列作为键,平均精度提高 13.7%至 39%,在酶数据集上使用反应特征作为键,平均精度提高 16.9%至 63%。
代码和数据集可在 https://github.com/HassounLab/CSI 上获得。