Cao Mengfei, Cowen Lenore J
Department of Computer Science, Tufts University, Medford, MA 02155, USA,
Pac Symp Biocomput. 2017;22:15-26. doi: 10.1142/9789813207813_0003.
Current automated computational methods to assign functional labels to unstudied genes often involve transferring annotation from orthologous or paralogous genes, however such genes can evolve divergent functions, making such transfer inappropriate. We consider the problem of determining when it is correct to make such an assignment between paralogs. We construct a benchmark dataset of two types of similar paralogous pairs of genes in the well-studied model organism S. cerevisiae: one set of pairs where single deletion mutants have very similar phenotypes (implying similar functions), and another set of pairs where single deletion mutants have very divergent phenotypes (implying different functions). State of the art methods for this problem will determine the evolutionary history of the paralogs with references to multiple related species. Here, we ask a first and simpler question: we explore to what extent any computational method with access only to data from a single species can solve this problem.We consider divergence data (at both the amino acid and nucleotide levels), and network data (based on the yeast protein-protein interaction network, as captured in BioGRID), and ask if we can extract features from these data that can distinguish between these sets of paralogous gene pairs. We find that the best features come from measures of sequence divergence, however, simple network measures based on degree or centrality or shortest path or diffusion state distance (DSD), or shared neighborhood in the yeast protein-protein interaction (PPI) network also contain some signal. One should, in general, not transfer function if sequence divergence is too high. Further improvements in classification will need to come from more computationally expensive but much more powerful evolutionary methods that incorporate ancestral states and measure evolutionary divergence over multiple species based on evolutionary trees.
当前用于为未研究基因分配功能标签的自动化计算方法通常涉及从直系同源或旁系同源基因转移注释,然而这些基因可能会进化出不同的功能,使得这种转移并不合适。我们考虑确定何时在旁系同源基因之间进行这种分配是正确的问题。我们在经过充分研究的模式生物酿酒酵母中构建了一个由两种类型的相似旁系同源基因对组成的基准数据集:一组基因对,其单基因缺失突变体具有非常相似的表型(意味着功能相似),另一组基因对,其单基因缺失突变体具有非常不同的表型(意味着功能不同)。针对这个问题的现有方法将参考多个相关物种来确定旁系同源基因的进化历史。在这里,我们提出一个首要且更简单的问题:我们探索仅能访问单个物种数据的任何计算方法在多大程度上可以解决这个问题。我们考虑分歧数据(在氨基酸和核苷酸水平)以及网络数据(基于酵母蛋白质 - 蛋白质相互作用网络,如在BioGRID中所捕获的),并询问我们是否可以从这些数据中提取能够区分这些旁系同源基因对集合的特征。我们发现最佳特征来自序列分歧的度量,然而,基于度、中心性、最短路径或扩散状态距离(DSD)的简单网络度量,或者酵母蛋白质 - 蛋白质相互作用(PPI)网络中的共享邻域也包含一些信号。一般来说,如果序列分歧过高,就不应转移功能。分类的进一步改进将需要来自更计算昂贵但更强大的进化方法,这些方法纳入祖先状态并基于进化树测量多个物种的进化分歧。