Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, La Plaine Campus, Triomflaan, Belgium.
Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, Belgium.
Sci Rep. 2016 Nov 18;6:36679. doi: 10.1038/srep36679.
Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON.
下一代测序技术大大增加了已知蛋白质序列的数量,而相关的实验确定的蛋白质结构则落后了。结构生物信息学试图通过开发能够预测未表征蛋白质序列的结构水平特征的方法来缩小这一差距,其中大多数开发的方法严重依赖于从同源序列中收集的进化信息。在这里,我们表明这种方法存在大量观察选择偏差:预测是在 PDB 中具有已知结构的蛋白质上进行验证的,但与随机从 Uniprot 中提取的研究较少的序列相比,这些蛋白质的同源物明显更多。以这种方式开发的结构生物信息学方法因此可能高估了性能;我们为此展示了两种接触预测方法,当考虑到更现实的进化信息量时,它们的性能下降了多达 60%。我们提供了一个名为 NOUMENON 的接触预测方法验证的无偏差数据集。