一种用于聚类蛋白质序列的新型无比对向量方法。

A novel alignment-free vector method to cluster protein sequences.

作者信息

He Lily, Li Yongkun, He Rong Lucy, Yau Stephen S-T

机构信息

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.

Department of Biological Sciences, Chicago State University, Chicago, IL, USA.

出版信息

J Theor Biol. 2017 Aug 1;427:41-52. doi: 10.1016/j.jtbi.2017.06.002. Epub 2017 Jun 3.

DOI:10.1016/j.jtbi.2017.06.002

PMID:28587743

Abstract

Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important biochemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.

摘要

蛋白质分类是生物学中的关键课题。在过去十年中，数据库中存储的蛋白质序列数量急剧增加。传统上，蛋白质序列的比较通常通过多序列比对方法进行。然而，当基因重排发生时，如在病毒基因组中，这些方法可能不适用于蛋白质序列的聚类。对于具有长基因组的大型数据集，计算也非常耗时。在本文中，基于氨基酸的三个重要生化特性：亲水性指数、极性需求和侧链的化学组成，我们提出了一个24维特征向量来描述蛋白质序列中氨基酸的组成。我们的方法不仅利用了氨基酸的化学性质，还考虑了它们的数量和位置。在β-珠蛋白、哺乳动物和三个病毒数据集上的结果表明，这个新工具在蛋白质分类和推断生物系统发育方面快速且准确。

相似文献

A novel alignment-free vector method to cluster protein sequences.一种用于聚类蛋白质序列的新型无比对向量方法。

J Theor Biol. 2017 Aug 1;427:41-52. doi: 10.1016/j.jtbi.2017.06.002. Epub 2017 Jun 3.

A novel fast vector method for genetic sequence comparison.一种新的快速向量方法用于遗传序列比较。

Sci Rep. 2017 Sep 22;7(1):12226. doi: 10.1038/s41598-017-12493-2.

Protein map: an alignment-free sequence comparison method based on various properties of amino acids.蛋白质图谱：一种基于氨基酸各种性质的无比对序列比较方法。

Gene. 2011 Oct 15;486(1-2):110-8. doi: 10.1016/j.gene.2011.07.002. Epub 2011 Jul 19.

An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison.一种基于氨基酸理化性质的序列比对无标度测度方法。

Comput Biol Chem. 2019 Jun;80:10-15. doi: 10.1016/j.compbiolchem.2019.01.005. Epub 2019 Jan 18.

Classification of Protein Sequences by a Novel Alignment-Free Method on Bacterial and Virus Families.基于新型无比对方法对细菌和病毒家族的蛋白质序列分类。

Genes (Basel). 2022 Sep 27;13(10):1744. doi: 10.3390/genes13101744.

An alignment-free method to find similarity among protein sequences via the general form of Chou's pseudo amino acid composition.一种通过 Chou 的通用伪氨基酸组成形式来寻找蛋白质序列之间相似性的无对齐方法。

SAR QSAR Environ Res. 2013;24(7):597-609. doi: 10.1080/1062936X.2013.773378. Epub 2013 May 28.

A new distribution vector and its application in genome clustering.一种新的分布向量及其在基因组聚类中的应用。

Mol Phylogenet Evol. 2011 May;59(2):438-43. doi: 10.1016/j.ympev.2011.02.020. Epub 2011 Mar 6.

Deriving the phylogenetic information from some physicochemical properties of protein sequences computed.从计算得到的蛋白质序列的某些理化性质中推导出系统发育信息。

J Comput Chem. 2011 Jan 15;32(1):70-80. doi: 10.1002/jcc.21599.

High-quality sequence clustering guided by network topology and multiple alignment likelihood.网络拓扑和多重比对可能性引导的高质量序列聚类。

Bioinformatics. 2012 Apr 15;28(8):1078-85. doi: 10.1093/bioinformatics/bts098. Epub 2012 Feb 25.

A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering.一种通过傅里叶变换衡量DNA序列相似性及其在层次聚类中的应用

J Theor Biol. 2014 Oct 21;359:18-28. doi: 10.1016/j.jtbi.2014.05.043. Epub 2014 Jun 6.

引用本文的文献

Antimicrobial activities and phylogenetic study of DC (Fabaceae) seed lectin.豆科植物DC种子凝集素的抗菌活性及系统发育研究

BioTechnologia (Pozn). 2023 Mar 27;104(1):21-32. doi: 10.5114/bta.2023.125083. eCollection 2023.

Integration of In Silico and In Vitro Analysis of Gliotoxin Production Reveals a Narrow Range of Producing Fungal Species.对Gliotoxin产生的计算机模拟和体外分析的整合揭示了产生该毒素的真菌物种范围狭窄。

J Fungi (Basel). 2022 Mar 31;8(4):361. doi: 10.3390/jof8040361.

SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform.SSAW：一种基于平稳离散小波变换的新序列相似性分析方法。

BMC Bioinformatics. 2018 May 2;19(1):165. doi: 10.1186/s12859-018-2155-9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种用于聚类蛋白质序列的新型无比对向量方法。

A novel alignment-free vector method to cluster protein sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献