Rogers David M
National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA.
Entropy (Basel). 2020 Oct 31;22(11):1242. doi: 10.3390/e22111242.
Automated identification of protein conformational states from simulation of an ensemble of structures is a hard problem because it requires teaching a computer to recognize shapes. We adapt the naïve Bayes classifier from the machine learning community for use on atom-to-atom pairwise contacts. The result is an unsupervised learning algorithm that samples a 'distribution' over potential classification schemes. We apply the classifier to a series of test structures and one real protein, showing that it identifies the conformational transition with >95% accuracy in most cases. A nontrivial feature of our adaptation is a new connection to information entropy that allows us to vary the level of structural detail without spoiling the categorization. This is confirmed by comparing results as the number of atoms and time-samples are varied over 1.5 orders of magnitude. Further, the method's derivation from Bayesian analysis on the set of inter-atomic contacts makes it easy to understand and extend to more complex cases.
从一组结构的模拟中自动识别蛋白质构象状态是一个难题,因为这需要教会计算机识别形状。我们采用了机器学习领域的朴素贝叶斯分类器,用于原子对原子的成对接触。结果得到了一种无监督学习算法,该算法对潜在的分类方案进行“分布”采样。我们将该分类器应用于一系列测试结构和一个真实蛋白质,结果表明在大多数情况下它能以超过95%的准确率识别构象转变。我们改编的一个重要特性是与信息熵的新联系,这使我们能够在不破坏分类的情况下改变结构细节的程度。通过比较原子数量和时间样本在1.5个数量级上变化时的结果,这一点得到了证实。此外,该方法从对原子间接触集的贝叶斯分析推导而来,使其易于理解并扩展到更复杂的情况。