IEEE/ACM Trans Comput Biol Bioinform. 2019 Nov-Dec;16(6):1773-1784. doi: 10.1109/TCBB.2018.2840996. Epub 2018 May 30.
We present a human protein cluster analysis by combining: 1) n-gram based amino acid frequency features, 2) optimal feature selection, 3) hierarchical clustering, and 4) advanced partitioning techniques. Our method qualitatively and quantitatively groups proteins with increasing sequence similarity into similar clusters by calculating the frequency model of amino acids using n-grams. We experiment with n = 1, i.e., unigrams, n = 2, i.e., bigrams, and finally n = 3, i.e., trigrams for optimal selection of features to design the 3gClust algorithm. The benchmarking results on 20,105 manually curated human proteins show that 3gClust ensures better cluster compactness in the case of proteins with similar functional groups, biological processes, structural alignment, and shared domains (e.g., aquaporins, keratins). Quantitative analysis of non singleton clusters shows significant improvement in their compactness in comparison to other state-of-the art methodologies. 3gClust is available at https://sites.google.com/site/bioinfoju/projects/3gclust for academic use along with supplementary materials, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2840996, and datasets.
1)基于 n-gram 的氨基酸频率特征,2)最优特征选择,3)层次聚类,4)高级分区技术。我们的方法通过使用 n-gram 计算氨基酸的频率模型,定性和定量地将具有递增序列相似性的蛋白质分组到相似的簇中。我们尝试了 n = 1,即单字,n = 2,即双字,最后是 n = 3,即三字,以最优地选择特征来设计 3gClust 算法。在 20105 个人工编辑的人类蛋白质上进行的基准测试结果表明,在具有相似功能组、生物过程、结构比对和共享结构域(例如水通道蛋白、角蛋白)的蛋白质中,3gClust 确保了更好的簇紧凑性。对非单例簇的定量分析表明,与其他最先进的方法相比,它们的紧凑性有了显著提高。3gClust 可在 https://sites.google.com/site/bioinfoju/projects/3gclust 上供学术使用,同时提供补充材料,这些材料可在计算机学会数字图书馆上找到,网址为 http://doi.ieeecomputersociety.org/10.1109/TCBB.2018.2840996,以及数据集。