使用针对氨基酸的密码子的数值表示将序列映射到特征向量，用于无比对序列分析。

Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis.

机构信息

Applied Statistics Unit, Indian Statistical Institute, Kolkata 700108, India; Department of Pediatrics, School of Medicine, Johns Hopkins University, MD 21205, USA.

Department of Master of Computer Applications, MCKV Institute of Engineering, West Bengal 711204, India.

出版信息

Gene. 2021 Jan 15;766:145096. doi: 10.1016/j.gene.2020.145096. Epub 2020 Sep 9.

DOI:10.1016/j.gene.2020.145096

PMID:32919006

Abstract

The phylogenetic analysis based on sequence similarity targeted to real biological taxa is one of the major challenging tasks. In this paper, we propose a novel alignment-free method, CoFASA (Codon Feature based Amino acid Sequence Analyser), for similarity analysis of nucleotide sequences. At first, we assign numerical weights to the four nucleotides. We then calculate a score of each codon based on the numerical value of the constituent nucleotides, termed as degree of codons. Accordingly, we obtain the degree of each amino acid based on the degree of codons targeted towards a specific amino acid. Utilizing the degree of twenty amino acids and their relative abundance within a given sequence, we generate 20-dimensional features for every coding DNA sequence or protein sequence. We use the features for performing phylogenetic analysis of the set of candidate sequences. We use multiple protein sequences derived from Beta-globin (BG), NADH dehydrogenase subunit 5 (ND5), Transferrins (TFs), Xylanases, low identity (<40%) and high identity (⩾40%) protein sequences (encompassing 533 and 1064 protein families) for experimental assessments. We compare our results with sixteen (16) well-known methods, including both alignment-based and alignment-free methods. Various assessment indices are used, such as the Pearson correlation coefficient, RF (Robinson-Foulds) distance and ROC score for performance analysis. While comparing the performance of CoFASA with alignment-based methods (ClustalW, ClustalΩ, MAFFT, and MUSCLE), it shows very similar results. Further, CoFASA shows better performance in comparison to well-known alignment-free methods, including LZW-Kernal, jD2Stat, FFP, spaced, and AFKS-D2s in predicting taxonomic relationship among candidate taxa. Overall, we observe that the features derived by CoFASA are very much useful in isolating the sequences according to their taxonomic labels. While our method is cost-effective, at the same time, produces consistent and satisfactory outcomes.

摘要

基于真实生物分类单元的序列相似性的系统发育分析是主要的挑战性任务之一。在本文中，我们提出了一种新颖的无比对方法 CoFASA（基于密码子特征的氨基酸序列分析器），用于核苷酸序列的相似性分析。首先，我们为四个核苷酸分配数值权重。然后，我们根据组成核苷酸的数值计算每个密码子的分数，称为密码子的度数。因此，我们根据针对特定氨基酸的密码子的度数获得每个氨基酸的度数。利用 20 种氨基酸的度数及其在给定序列中的相对丰度，我们为每个编码 DNA 序列或蛋白质序列生成 20 维特征。我们使用这些特征来对候选序列集进行系统发育分析。我们使用来自β-球蛋白（BG）、NADH 脱氢酶亚基 5（ND5）、转铁蛋白（TFs）、木聚糖酶的多个蛋白质序列，低同一性（<40%）和高同一性（⩾40%）蛋白质序列（包含 533 和 1064 个蛋白质家族）进行实验评估。我们将结果与十六种（16）知名方法进行比较，包括基于比对和无比对的方法。使用各种评估指标，如 Pearson 相关系数、RF（罗宾逊-福尔德）距离和 ROC 得分进行性能分析。在将 CoFASA 与基于比对的方法（ClustalW、ClustalΩ、MAFFT 和 MUSCLE）的性能进行比较时，结果非常相似。此外，CoFASA 在预测候选分类单元之间的分类关系方面，优于包括 LZW-Kernal、jD2Stat、FFP、spaced 和 AFKS-D2s 在内的知名无比对方法。总体而言，我们观察到 CoFASA 生成的特征非常有助于根据其分类标签分离序列。虽然我们的方法具有成本效益，但同时也能产生一致且令人满意的结果。

相似文献

Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis.

Gene. 2021 Jan 15;766:145096. doi: 10.1016/j.gene.2020.145096. Epub 2020 Sep 9.

An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids.

PLoS One. 2016 Dec 5;11(12):e0167430. doi: 10.1371/journal.pone.0167430. eCollection 2016.

Graphical Representation and Similarity Analysis of Protein Sequences Based on Fractal Interpolation.

IEEE/ACM Trans Comput Biol Bioinform. 2017 Jan-Feb;14(1):182-192. doi: 10.1109/TCBB.2015.2511731. Epub 2015 Dec 29.

An alignment-free method to find similarity among protein sequences via the general form of Chou's pseudo amino acid composition.

SAR QSAR Environ Res. 2013;24(7):597-609. doi: 10.1080/1062936X.2013.773378. Epub 2013 May 28.

On the quality of tree-based protein classification.

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

On combining protein sequences and nucleic acid sequences in phylogenetic analysis: the homeobox protein case.

Cladistics. 1996;12:65-82. doi: 10.1111/j.1096-0031.1996.tb00193.x.

Biomed Res Int. 2019 Nov 22;2019:2796971. doi: 10.1155/2019/2796971. eCollection 2019.

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.

BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252.

Resolving discrepancy between nucleotides and amino acids in deep-level arthropod phylogenomics: differentiating serine codons in 21-amino-acid models.

PLoS One. 2012;7(11):e47450. doi: 10.1371/journal.pone.0047450. Epub 2012 Nov 20.

Detailed protein sequence alignment based on Spectral Similarity Score (SSS).

BMC Bioinformatics. 2005 Apr 23;6:105. doi: 10.1186/1471-2105-6-105.

引用本文的文献

Prevalence and transmission risk of colistin and multidrug resistance in long-distance coastal aquaculture.

ISME Commun. 2023 Nov 7;3(1):115. doi: 10.1038/s43705-023-00321-w.

A Small Molecule Inhibitor of Erg251 Makes Fluconazole Fungicidal by Inhibiting the Synthesis of the 14α-Methylsterols.

mBio. 2023 Feb 28;14(1):e0263922. doi: 10.1128/mbio.02639-22. Epub 2022 Dec 8.

FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis.

BMC Bioinformatics. 2022 Aug 19;23(1):347. doi: 10.1186/s12859-022-04889-3.

Clade GR and clade GH isolates of SARS-CoV-2 in Asia show highest amount of SNPs.

Infect Genet Evol. 2021 Apr;89:104724. doi: 10.1016/j.meegid.2021.104724. Epub 2021 Jan 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用针对氨基酸的密码子的数值表示将序列映射到特征向量，用于无比对序列分析。

Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献