Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA.
Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.
Bioinformatics. 2019 May 15;35(10):1737-1744. doi: 10.1093/bioinformatics/bty834.
Due to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Most domain-centric function prediction methods depend on accurate domain family assignments to infer relationships between domains and functions, with regions that are unassigned to a known domain-family left out of functional evaluation. Given the abundance of residue-level annotations currently available, we present a function prediction methodology that automatically infers function labels of specific protein regions using protein-level annotations and multiple types of region-specific features.
We apply this method to local features obtained from InterPro, UniProtKB and amino acid sequences and show that this method improves both the accuracy and region-specificity of protein function transfer and prediction. We compare region-level predictive performance of our method against that of a whole-protein baseline method using proteins with structurally verified binding sites and also compare protein-level temporal holdout predictive performances to expand the variety and specificity of GO terms we could evaluate. Our results can also serve as a starting point to categorize GO terms into region-specific and whole-protein terms and select prediction methods for different classes of GO terms.
The code and features are freely available at: https://github.com/ek1203/rsfp.
Supplementary data are available at Bioinformatics online.
由于实验注释的性质,大多数蛋白质功能预测方法都在蛋白质水平上进行操作,即根据整体相似性将功能分配给全长蛋白质。然而,大多数蛋白质通过与其他蛋白质或分子相互作用而发挥功能,并且许多功能关联应该仅限于特定区域,而不是整个蛋白质长度。大多数基于结构域的功能预测方法都依赖于准确的结构域家族分配来推断结构域和功能之间的关系,而那些未分配给已知结构域家族的区域则被排除在功能评估之外。鉴于目前可用的残基级注释的丰富性,我们提出了一种功能预测方法,该方法使用蛋白质水平的注释和多种类型的区域特定特征自动推断特定蛋白质区域的功能标签。
我们将这种方法应用于从 InterPro、UniProtKB 和氨基酸序列中获得的局部特征,并表明该方法提高了蛋白质功能转移和预测的准确性和区域特异性。我们使用具有结构验证结合位点的蛋白质比较了区域级预测性能,比较了蛋白质水平的时间保留预测性能,以扩展我们可以评估的 GO 术语的多样性和特异性。我们的结果还可以作为将 GO 术语分类为区域特定和全蛋白质术语的起点,并为不同类别的 GO 术语选择预测方法。
代码和特征可在以下网址免费获得:https://github.com/ek1203/rsfp。
补充数据可在生物信息学在线获得。