Wallace Tim, Sekmen Ali, Wang Xiaofei
1 Department of Computer Science, Tennessee State University , Nashville, Tennessee.
2 Department of Biological Sciences, Tennessee State University , Nashville, Tennessee.
J Comput Biol. 2015 Oct;22(10):940-52. doi: 10.1089/cmb.2015.0084. Epub 2015 Jul 10.
Identification and clustering of orthologous genes plays an important role in developing evolutionary models such as validating convergent and divergent phylogeny and predicting functional proteins in newly sequenced species of unverified nucleotide protein mappings. Here, we introduce an application of subspace clustering as applied to orthologous gene sequences and discuss the initial results. The working hypothesis is based upon the concept that genetic changes between nucleotide sequences coding for proteins among selected species and groups may lie within a union of subspaces for clusters of the orthologous groups. Estimates for the subspace dimensions were computed for a small population sample. A series of experiments was performed to cluster randomly selected sequences. The experimental design allows for both false positives and false negatives, and estimates for the statistical significance are provided. The clustering results are consistent with the main hypothesis. A simple random mutation binary tree model is used to simulate speciation events that show the interdependence of the subspace rank versus time and mutation rates. The simple mutation model is found to be largely consistent with the observed subspace clustering singular value results. Our study indicates that the subspace clustering method may be applied in orthology analysis.
直系同源基因的识别与聚类在进化模型的构建中发挥着重要作用,例如验证趋同和趋异系统发育以及预测新测序物种中未经证实的核苷酸蛋白质映射中的功能蛋白。在此,我们介绍一种应用于直系同源基因序列的子空间聚类方法,并讨论初步结果。该工作假设基于这样的概念,即所选物种和群体中编码蛋白质的核苷酸序列之间的遗传变化可能存在于直系同源组聚类的子空间并集中。针对一小部分群体样本计算了子空间维度估计值。进行了一系列实验以对随机选择的序列进行聚类。该实验设计允许出现假阳性和假阴性,并提供了统计显著性估计。聚类结果与主要假设一致。使用简单随机突变二叉树模型来模拟物种形成事件,该模型显示了子空间秩与时间和突变率之间的相互依存关系。发现简单突变模型在很大程度上与观察到的子空间聚类奇异值结果一致。我们的研究表明,子空间聚类方法可应用于直系同源分析。