• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用针对氨基酸的密码子的数值表示将序列映射到特征向量,用于无比对序列分析。

Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis.

机构信息

Applied Statistics Unit, Indian Statistical Institute, Kolkata 700108, India; Department of Pediatrics, School of Medicine, Johns Hopkins University, MD 21205, USA.

Department of Master of Computer Applications, MCKV Institute of Engineering, West Bengal 711204, India.

出版信息

Gene. 2021 Jan 15;766:145096. doi: 10.1016/j.gene.2020.145096. Epub 2020 Sep 9.

DOI:10.1016/j.gene.2020.145096
PMID:32919006
Abstract

The phylogenetic analysis based on sequence similarity targeted to real biological taxa is one of the major challenging tasks. In this paper, we propose a novel alignment-free method, CoFASA (Codon Feature based Amino acid Sequence Analyser), for similarity analysis of nucleotide sequences. At first, we assign numerical weights to the four nucleotides. We then calculate a score of each codon based on the numerical value of the constituent nucleotides, termed as degree of codons. Accordingly, we obtain the degree of each amino acid based on the degree of codons targeted towards a specific amino acid. Utilizing the degree of twenty amino acids and their relative abundance within a given sequence, we generate 20-dimensional features for every coding DNA sequence or protein sequence. We use the features for performing phylogenetic analysis of the set of candidate sequences. We use multiple protein sequences derived from Beta-globin (BG), NADH dehydrogenase subunit 5 (ND5), Transferrins (TFs), Xylanases, low identity (<40%) and high identity (⩾40%) protein sequences (encompassing 533 and 1064 protein families) for experimental assessments. We compare our results with sixteen (16) well-known methods, including both alignment-based and alignment-free methods. Various assessment indices are used, such as the Pearson correlation coefficient, RF (Robinson-Foulds) distance and ROC score for performance analysis. While comparing the performance of CoFASA with alignment-based methods (ClustalW, ClustalΩ, MAFFT, and MUSCLE), it shows very similar results. Further, CoFASA shows better performance in comparison to well-known alignment-free methods, including LZW-Kernal, jD2Stat, FFP, spaced, and AFKS-D2s in predicting taxonomic relationship among candidate taxa. Overall, we observe that the features derived by CoFASA are very much useful in isolating the sequences according to their taxonomic labels. While our method is cost-effective, at the same time, produces consistent and satisfactory outcomes.

摘要

基于真实生物分类单元的序列相似性的系统发育分析是主要的挑战性任务之一。在本文中,我们提出了一种新颖的无比对方法 CoFASA(基于密码子特征的氨基酸序列分析器),用于核苷酸序列的相似性分析。首先,我们为四个核苷酸分配数值权重。然后,我们根据组成核苷酸的数值计算每个密码子的分数,称为密码子的度数。因此,我们根据针对特定氨基酸的密码子的度数获得每个氨基酸的度数。利用 20 种氨基酸的度数及其在给定序列中的相对丰度,我们为每个编码 DNA 序列或蛋白质序列生成 20 维特征。我们使用这些特征来对候选序列集进行系统发育分析。我们使用来自β-球蛋白(BG)、NADH 脱氢酶亚基 5(ND5)、转铁蛋白(TFs)、木聚糖酶的多个蛋白质序列,低同一性(<40%)和高同一性(⩾40%)蛋白质序列(包含 533 和 1064 个蛋白质家族)进行实验评估。我们将结果与十六种(16)知名方法进行比较,包括基于比对和无比对的方法。使用各种评估指标,如 Pearson 相关系数、RF(罗宾逊-福尔德)距离和 ROC 得分进行性能分析。在将 CoFASA 与基于比对的方法(ClustalW、ClustalΩ、MAFFT 和 MUSCLE)的性能进行比较时,结果非常相似。此外,CoFASA 在预测候选分类单元之间的分类关系方面,优于包括 LZW-Kernal、jD2Stat、FFP、spaced 和 AFKS-D2s 在内的知名无比对方法。总体而言,我们观察到 CoFASA 生成的特征非常有助于根据其分类标签分离序列。虽然我们的方法具有成本效益,但同时也能产生一致且令人满意的结果。

相似文献

1
Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis.使用针对氨基酸的密码子的数值表示将序列映射到特征向量,用于无比对序列分析。
Gene. 2021 Jan 15;766:145096. doi: 10.1016/j.gene.2020.145096. Epub 2020 Sep 9.
2
An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids.一种基于氨基酸间伪马尔可夫转移概率比较蛋白质序列相似性的无比对算法。
PLoS One. 2016 Dec 5;11(12):e0167430. doi: 10.1371/journal.pone.0167430. eCollection 2016.
3
Graphical Representation and Similarity Analysis of Protein Sequences Based on Fractal Interpolation.基于分形插值的蛋白质序列图形表示与相似性分析
IEEE/ACM Trans Comput Biol Bioinform. 2017 Jan-Feb;14(1):182-192. doi: 10.1109/TCBB.2015.2511731. Epub 2015 Dec 29.
4
An alignment-free method to find similarity among protein sequences via the general form of Chou's pseudo amino acid composition.一种通过 Chou 的通用伪氨基酸组成形式来寻找蛋白质序列之间相似性的无对齐方法。
SAR QSAR Environ Res. 2013;24(7):597-609. doi: 10.1080/1062936X.2013.773378. Epub 2013 May 28.
5
On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。
Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.
6
On combining protein sequences and nucleic acid sequences in phylogenetic analysis: the homeobox protein case.系统发育分析中蛋白质序列与核酸序列的结合:同源异型框蛋白实例
Cladistics. 1996;12:65-82. doi: 10.1111/j.1096-0031.1996.tb00193.x.
7
Measuring Similarity among Protein Sequences Using a New Descriptor.使用新描述符衡量蛋白质序列之间的相似性。
Biomed Res Int. 2019 Nov 22;2019:2796971. doi: 10.1155/2019/2796971. eCollection 2019.
8
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment.通过通用相似性度量对生物序列和结构进行基于压缩的分类:实验评估
BMC Bioinformatics. 2007 Jul 13;8:252. doi: 10.1186/1471-2105-8-252.
9
Resolving discrepancy between nucleotides and amino acids in deep-level arthropod phylogenomics: differentiating serine codons in 21-amino-acid models.解决深层节肢动物系统发育基因组学中核苷酸和氨基酸之间的差异:区分 21 种氨基酸模型中的丝氨酸密码子。
PLoS One. 2012;7(11):e47450. doi: 10.1371/journal.pone.0047450. Epub 2012 Nov 20.
10
Detailed protein sequence alignment based on Spectral Similarity Score (SSS).基于光谱相似性评分(SSS)的详细蛋白质序列比对。
BMC Bioinformatics. 2005 Apr 23;6:105. doi: 10.1186/1471-2105-6-105.

引用本文的文献

1
Prevalence and transmission risk of colistin and multidrug resistance in long-distance coastal aquaculture.远距离沿海水产养殖中粘菌素和多重耐药性的流行情况及传播风险
ISME Commun. 2023 Nov 7;3(1):115. doi: 10.1038/s43705-023-00321-w.
2
A Small Molecule Inhibitor of Erg251 Makes Fluconazole Fungicidal by Inhibiting the Synthesis of the 14α-Methylsterols.一种小分子 Erg251 抑制剂通过抑制 14α-甲基固醇的合成使氟康唑具有杀真菌作用。
mBio. 2023 Feb 28;14(1):e0263922. doi: 10.1128/mbio.02639-22. Epub 2022 Dec 8.
3
FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis.
FFP:氨基酸特性感知系统发育分析中的联合快速傅里叶变换和分形维数。
BMC Bioinformatics. 2022 Aug 19;23(1):347. doi: 10.1186/s12859-022-04889-3.
4
Clade GR and clade GH isolates of SARS-CoV-2 in Asia show highest amount of SNPs.亚洲的 SARS-CoV-2 的 clade GR 和 clade GH 分离株显示出最高数量的 SNPs。
Infect Genet Evol. 2021 Apr;89:104724. doi: 10.1016/j.meegid.2021.104724. Epub 2021 Jan 19.