Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY, USA.
Bioinformatics. 2019 Jan 1;35(1):12-19. doi: 10.1093/bioinformatics/bty523.
The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein's total residues to correctly identify half of the protein's functional site residues. The overwhelming proportion of false positives results in reported 'F-Scores' of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs).
The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance given the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis.
Supplementary data are available at Bioinformatics online.
半个多世纪以来,序列保守模式分析已广泛用于鉴定功能重要的(催化和配体结合)蛋白质残基。尽管经过几十年的发展,基于非模板的最先进功能残基预测方法平均必须预测蛋白质总残基的约 25%,才能正确识别蛋白质功能位点残基的一半。大量的假阳性结果导致报告的“F 分数”约为 0.3。我们研究了当前方法的局限性,重点关注多序列比对 (MSA) 中包含的同源物的具体选择迄今为止被忽视的影响。
通过调查 1023 个蛋白质的结合位点,探讨了基于保守的功能残基预测的局限性。对从 PSI-BLAST 搜索中随机选择的同源物组成的 MSA 进行简单的保守分析,平均 F 分数约为 0.3,与最先进方法的报告性能相匹配,后者通常在机器学习环境中考虑额外的特征进行预测。有趣的是,我们发现,一个简单的组合 MSA 采样算法几乎在每种情况下都会产生一个具有最佳同源物集的 MSA,其保守分析的平均 F 分数约为 0.6,是最先进性能的两倍。我们还表明,鉴于不同结合位点定义之间的一致性,这几乎是可能性能的理论极限。此外,我们展示了 Selection of Alignment by Maximal Mutual Information (SAMMI)(一种基于信息论的识别生物信息 MSAs 的方法)在这一方向上取得的进展。这项工作强调了最优组成的 MSA 对保守分析的重要性和未被充分利用的潜力。
补充数据可在 Bioinformatics 在线获得。