Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA.
J Bioinform Comput Biol. 2021 Feb;19(1):2140002. doi: 10.1142/S0219720021400023. Epub 2021 Feb 10.
Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods. A significant challenge in elucidating these dark regions relates to the ability to discriminate relevant structure(s) among many structures/decoys computed for a protein of interest, a problem known as decoy selection. Clustering decoys based on geometric similarity remains popular. However, it is unclear how exactly to exploit the groups of decoys revealed via clustering to select individual structures for prediction. In this paper, we provide an intuitive formulation of the decoy selection problem as an instance of unsupervised multi-instance learning. We address the problem in three stages, first organizing given decoys of a protein molecule into bags, then identifying relevant bags, and finally drawing individual instances from these bags to offer as prediction. We propose both non-parametric and parametric algorithms for drawing individual instances. Our evaluation utilizes two datasets, one benchmark dataset of ensembles of decoys for a varied list of protein molecules, and a dataset of decoy ensembles for targets drawn from recent CASP competitions. A comparative analysis with state-of-the-art methods reveals that the proposed approach outperforms existing methods, thus warranting further investigation of multi-instance learning to advance our treatment of decoy selection.
许多蛋白质领域的结构仍然无法通过湿实验室或计算结构测定方法获得。阐明这些暗区的一个重大挑战涉及到在计算出的许多感兴趣的蛋白质结构/诱饵中区分相关结构的能力,这是一个称为诱饵选择的问题。基于几何相似性对诱饵进行聚类仍然很流行。然而,目前尚不清楚如何利用通过聚类揭示的诱饵组来选择单个结构进行预测。在本文中,我们将诱饵选择问题作为无监督多实例学习的实例提供了一个直观的表述。我们分三个阶段解决该问题,首先将给定的蛋白质分子的诱饵组织成袋,然后识别相关的袋,最后从这些袋中提取单个实例作为预测。我们提出了用于提取单个实例的非参数和参数算法。我们的评估利用了两个数据集,一个是各种蛋白质分子的诱饵集合的基准数据集,另一个是来自最近的 CASP 竞赛的目标诱饵集合的数据集。与最先进方法的比较分析表明,所提出的方法优于现有方法,因此有必要进一步研究多实例学习,以推进我们对诱饵选择的处理。