Applied Statistics Unit, Indian Statistical Institute, Kolkata 700108, India; Department of Pediatrics, School of Medicine, Johns Hopkins University, MD 21205, USA.
Department of Master of Computer Applications, MCKV Institute of Engineering, West Bengal 711204, India.
Gene. 2021 Jan 15;766:145096. doi: 10.1016/j.gene.2020.145096. Epub 2020 Sep 9.
The phylogenetic analysis based on sequence similarity targeted to real biological taxa is one of the major challenging tasks. In this paper, we propose a novel alignment-free method, CoFASA (Codon Feature based Amino acid Sequence Analyser), for similarity analysis of nucleotide sequences. At first, we assign numerical weights to the four nucleotides. We then calculate a score of each codon based on the numerical value of the constituent nucleotides, termed as degree of codons. Accordingly, we obtain the degree of each amino acid based on the degree of codons targeted towards a specific amino acid. Utilizing the degree of twenty amino acids and their relative abundance within a given sequence, we generate 20-dimensional features for every coding DNA sequence or protein sequence. We use the features for performing phylogenetic analysis of the set of candidate sequences. We use multiple protein sequences derived from Beta-globin (BG), NADH dehydrogenase subunit 5 (ND5), Transferrins (TFs), Xylanases, low identity (<40%) and high identity (⩾40%) protein sequences (encompassing 533 and 1064 protein families) for experimental assessments. We compare our results with sixteen (16) well-known methods, including both alignment-based and alignment-free methods. Various assessment indices are used, such as the Pearson correlation coefficient, RF (Robinson-Foulds) distance and ROC score for performance analysis. While comparing the performance of CoFASA with alignment-based methods (ClustalW, ClustalΩ, MAFFT, and MUSCLE), it shows very similar results. Further, CoFASA shows better performance in comparison to well-known alignment-free methods, including LZW-Kernal, jD2Stat, FFP, spaced, and AFKS-D2s in predicting taxonomic relationship among candidate taxa. Overall, we observe that the features derived by CoFASA are very much useful in isolating the sequences according to their taxonomic labels. While our method is cost-effective, at the same time, produces consistent and satisfactory outcomes.
基于真实生物分类单元的序列相似性的系统发育分析是主要的挑战性任务之一。在本文中,我们提出了一种新颖的无比对方法 CoFASA(基于密码子特征的氨基酸序列分析器),用于核苷酸序列的相似性分析。首先,我们为四个核苷酸分配数值权重。然后,我们根据组成核苷酸的数值计算每个密码子的分数,称为密码子的度数。因此,我们根据针对特定氨基酸的密码子的度数获得每个氨基酸的度数。利用 20 种氨基酸的度数及其在给定序列中的相对丰度,我们为每个编码 DNA 序列或蛋白质序列生成 20 维特征。我们使用这些特征来对候选序列集进行系统发育分析。我们使用来自β-球蛋白(BG)、NADH 脱氢酶亚基 5(ND5)、转铁蛋白(TFs)、木聚糖酶的多个蛋白质序列,低同一性(<40%)和高同一性(⩾40%)蛋白质序列(包含 533 和 1064 个蛋白质家族)进行实验评估。我们将结果与十六种(16)知名方法进行比较,包括基于比对和无比对的方法。使用各种评估指标,如 Pearson 相关系数、RF(罗宾逊-福尔德)距离和 ROC 得分进行性能分析。在将 CoFASA 与基于比对的方法(ClustalW、ClustalΩ、MAFFT 和 MUSCLE)的性能进行比较时,结果非常相似。此外,CoFASA 在预测候选分类单元之间的分类关系方面,优于包括 LZW-Kernal、jD2Stat、FFP、spaced 和 AFKS-D2s 在内的知名无比对方法。总体而言,我们观察到 CoFASA 生成的特征非常有助于根据其分类标签分离序列。虽然我们的方法具有成本效益,但同时也能产生一致且令人满意的结果。