Dryden Ian L, Hirst Jonathan D, Melville James L
School of Mathematical Sciences, University of Nottingham, University Park, Nottingham NG7 2RD, UK.
Biometrics. 2007 Mar;63(1):237-51. doi: 10.1111/j.1541-0420.2006.00622.x.
We consider Bayesian methodology for comparing two or more unlabeled point sets. Application of the technique to a set of steroid molecules illustrates its potential utility involving the comparison of molecules in chemoinformatics and bioinformatics. We initially match a pair of molecules, where one molecule is regarded as random and the other fixed. A type of mixture model is proposed for the point set coordinates, and the parameters of the distribution are a labeling matrix (indicating which pairs of points match) and a concentration parameter. An important property of the likelihood is that it is invariant under rotations and translations of the data. Bayesian inference for the parameters is carried out using Markov chain Monte Carlo simulation, and it is demonstrated that the procedure works well on the steroid data. The posterior distribution is difficult to simulate from, due to multiple local modes, and we also use additional data (partial charges on atoms) to help with this task. An approximation is considered for speeding up the simulation algorithm, and the approximating fast algorithm leads to essentially identical inference to that under the exact method for our data. Extensions to multiple molecule alignment are also introduced, and an algorithm is described which also works well on the steroid data set. After all the steroid molecules have been matched, exploratory data analysis is carried out to examine which molecules are similar. Also, further Bayesian inference for the multiple alignment problem is considered.
我们考虑使用贝叶斯方法来比较两个或更多未标记的点集。将该技术应用于一组类固醇分子,说明了其在化学信息学和生物信息学中分子比较方面的潜在效用。我们首先匹配一对分子,其中一个分子被视为随机的,另一个是固定的。针对点集坐标提出了一种混合模型类型,分布的参数是一个标记矩阵(指示哪些点对匹配)和一个浓度参数。似然的一个重要特性是它在数据的旋转和平移下是不变的。使用马尔可夫链蒙特卡罗模拟对参数进行贝叶斯推断,结果表明该过程在类固醇数据上效果良好。由于存在多个局部模式,后验分布难以模拟,我们还使用额外的数据(原子上的部分电荷)来辅助这项任务。考虑了一种用于加速模拟算法的近似方法,对于我们的数据,近似快速算法得出的推断与精确方法下的推断基本相同。还介绍了对多个分子比对的扩展,并描述了一种在类固醇数据集上也能很好工作的算法。在所有类固醇分子都匹配之后,进行探索性数据分析以检查哪些分子是相似的。此外,还考虑了对多个比对问题的进一步贝叶斯推断。