Bae Kyounghwa, Mallick Bani K, Elsik Christine G
Department of Statistics, Texas A&M University College Station, TX 77843-3143, USA.
Bioinformatics. 2005 May 15;21(10):2264-70. doi: 10.1093/bioinformatics/bti363. Epub 2005 Mar 3.
Our aim was to predict protein interdomain linker regions using sequence alone, without requiring known homology. Identifying linker regions will delineate domain boundaries, and can be used to computationally dissect proteins into domains prior to clustering them into families. We developed a hidden Markov model of linker/non-linker sequence regions using a linker index derived from amino acid propensity. We employed an efficient Bayesian estimation of the model using Markov Chain Monte Carlo, Gibbs sampling in particular, to simulate parameters from the posteriors. Our model recognizes sequence data to be continuous rather than categorical, and generates a probabilistic output.
We applied our method to a dataset of protein sequences in which domains and interdomain linkers had been delineated using the Pfam-A database. The prediction results are superior to a simpler method that also uses linker index.
我们的目标是仅使用序列来预测蛋白质结构域间的连接区域,而无需已知的同源性。识别连接区域将划定结构域边界,并且可用于在将蛋白质聚类成家族之前,通过计算将蛋白质分解为结构域。我们使用源自氨基酸倾向的连接指数,开发了一种连接子/非连接子序列区域的隐马尔可夫模型。我们采用马尔可夫链蒙特卡罗方法,特别是吉布斯采样,对模型进行有效的贝叶斯估计,以从后验中模拟参数。我们的模型将序列数据识别为连续的而非分类的,并生成概率输出。
我们将我们的方法应用于一个蛋白质序列数据集,其中使用Pfam-A数据库划定了结构域和结构域间连接子。预测结果优于一种同样使用连接指数的更简单方法。