• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种基于氨基酸理化性质的精确无对齐蛋白质序列比较器。

An accurate alignment-free protein sequence comparator based on physicochemical properties of amino acids.

机构信息

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.

出版信息

Sci Rep. 2022 Jul 1;12(1):11158. doi: 10.1038/s41598-022-15266-8.

DOI:10.1038/s41598-022-15266-8
PMID:35778592
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9247937/
Abstract

Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence comparators are particularly crucial. On the other hand, the complexity of the problem, the growing number of extracted protein sequences, and the growth of studies and data analysis applications addressing protein sequences have necessitated the development of a rapid and accurate approach to account for the complexities in this field. As a result, we propose a protein sequence comparison approach, called PCV, which improves comparison accuracy by producing vectors that encode sequence data as well as physicochemical properties of the amino acids. At the same time, by partitioning the long protein sequences into fix-length blocks and providing encoding vector for each block, this method allows for parallel and fast implementation. To evaluate the performance of PCV, like other alignment-free methods, we used 12 benchmark datasets including classes with homologous sequences which may require a simple preprocessing search tool to select the homologous data. And then, we compared the protein sequence comparison outcomes to those of alternative alignment-based and alignment-free methods, using various evaluation criteria. These results indicate that our method provides significant improvement in sequence classification accuracy, compared to the alternative alignment-free methods and has an average correlation of about 94% with the ClustalW method as our reference method, while considerably reduces the processing time.

摘要

生物序列比对器是评估生物数据最基本和最重要的方法之一,因此,由于蛋白质的重要性,蛋白质序列比对器尤为关键。另一方面,由于问题的复杂性、提取的蛋白质序列数量的增加,以及针对蛋白质序列的研究和数据分析应用的增长,需要开发一种快速而准确的方法来解决该领域的复杂性。因此,我们提出了一种蛋白质序列比较方法,称为 PCV,该方法通过生成编码序列数据以及氨基酸理化性质的向量来提高比较准确性。同时,通过将长蛋白质序列分割成固定长度的块,并为每个块提供编码向量,该方法允许并行和快速实现。为了评估 PCV 的性能,与其他无比对方法一样,我们使用了 12 个基准数据集,包括具有同源序列的类,这些类可能需要一个简单的预处理搜索工具来选择同源数据。然后,我们使用各种评估标准将蛋白质序列比较结果与替代的基于比对和无比对方法进行比较。这些结果表明,与替代的无比对方法相比,我们的方法在序列分类准确性方面提供了显著的改进,与我们的参考方法 ClustalW 平均相关性约为 94%,同时大大减少了处理时间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/39444a6bb9a6/41598_2022_15266_Fig20_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/c1a91dfd5ea5/41598_2022_15266_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/ada04a4c3c83/41598_2022_15266_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/782bba954501/41598_2022_15266_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/5112457f7be6/41598_2022_15266_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/3f8a706ec991/41598_2022_15266_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/634766f17aab/41598_2022_15266_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/295c133ca30a/41598_2022_15266_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/090e15ee19d3/41598_2022_15266_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/c2ce1a81daf9/41598_2022_15266_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/e14dfcc3fffc/41598_2022_15266_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/7cae3e733738/41598_2022_15266_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/93d02d7f02b8/41598_2022_15266_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/51104da50ed3/41598_2022_15266_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/617570c81591/41598_2022_15266_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/3c2fe50cee8c/41598_2022_15266_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/9e46d3b8d01b/41598_2022_15266_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/09b2b6ec1e60/41598_2022_15266_Fig17_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/bfff943e9679/41598_2022_15266_Fig18_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/3b172635518a/41598_2022_15266_Fig19_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/39444a6bb9a6/41598_2022_15266_Fig20_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/c1a91dfd5ea5/41598_2022_15266_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/ada04a4c3c83/41598_2022_15266_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/782bba954501/41598_2022_15266_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/5112457f7be6/41598_2022_15266_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/3f8a706ec991/41598_2022_15266_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/634766f17aab/41598_2022_15266_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/295c133ca30a/41598_2022_15266_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/090e15ee19d3/41598_2022_15266_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/c2ce1a81daf9/41598_2022_15266_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/e14dfcc3fffc/41598_2022_15266_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/7cae3e733738/41598_2022_15266_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/93d02d7f02b8/41598_2022_15266_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/51104da50ed3/41598_2022_15266_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/617570c81591/41598_2022_15266_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/3c2fe50cee8c/41598_2022_15266_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/9e46d3b8d01b/41598_2022_15266_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/09b2b6ec1e60/41598_2022_15266_Fig17_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/bfff943e9679/41598_2022_15266_Fig18_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/3b172635518a/41598_2022_15266_Fig19_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bcd6/9249862/39444a6bb9a6/41598_2022_15266_Fig20_HTML.jpg

相似文献

1
An accurate alignment-free protein sequence comparator based on physicochemical properties of amino acids.一种基于氨基酸理化性质的精确无对齐蛋白质序列比较器。
Sci Rep. 2022 Jul 1;12(1):11158. doi: 10.1038/s41598-022-15266-8.
2
An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids.一种基于氨基酸间伪马尔可夫转移概率比较蛋白质序列相似性的无比对算法。
PLoS One. 2016 Dec 5;11(12):e0167430. doi: 10.1371/journal.pone.0167430. eCollection 2016.
3
transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.transAlign:利用氨基酸促进蛋白质编码DNA序列的多重比对。
BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.
4
Mapping sequence to feature vector using numerical representation of codons targeted to amino acids for alignment-free sequence analysis.使用针对氨基酸的密码子的数值表示将序列映射到特征向量,用于无比对序列分析。
Gene. 2021 Jan 15;766:145096. doi: 10.1016/j.gene.2020.145096. Epub 2020 Sep 9.
5
An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison.一种基于氨基酸理化性质的序列比对无标度测度方法。
Comput Biol Chem. 2019 Jun;80:10-15. doi: 10.1016/j.compbiolchem.2019.01.005. Epub 2019 Jan 18.
6
Periodic distributions of hydrophobic amino acids allows the definition of fundamental building blocks to align distantly related proteins.疏水性氨基酸的周期性分布有助于定义基本构建模块,从而比对远缘相关的蛋白质。
Proteins. 2007 May 15;67(3):695-708. doi: 10.1002/prot.21319.
7
A Generalized Iterative Map for Analysis of Protein Sequences.一种用于分析蛋白质序列的广义迭代映射。
Comb Chem High Throughput Screen. 2022;25(3):381-391. doi: 10.2174/1386207323666201012142318.
8
Fast model-based protein homology detection without alignment.基于快速模型的无需比对的蛋白质同源性检测。
Bioinformatics. 2007 Jul 15;23(14):1728-36. doi: 10.1093/bioinformatics/btm247. Epub 2007 May 8.
9
SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences.SCPRED:对与预测序列具有模糊相似性的序列的蛋白质结构类别进行准确预测。
BMC Bioinformatics. 2008 May 1;9:226. doi: 10.1186/1471-2105-9-226.
10
OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy.OXBench:一种用于评估蛋白质多序列比对准确性的基准。
BMC Bioinformatics. 2003 Oct 10;4:47. doi: 10.1186/1471-2105-4-47.

本文引用的文献

1
UniProt: the universal protein knowledgebase in 2021.UniProt:2021 年的通用蛋白质知识库。
Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100.
2
A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector.一种蛋白质的新型数值表示:三维混沌博弈表示及其扩展自然向量。
Comput Struct Biotechnol J. 2020 Jul 15;18:1904-1913. doi: 10.1016/j.csbj.2020.07.004. eCollection 2020.
3
Measuring Similarity among Protein Sequences Using a New Descriptor.
使用新描述符衡量蛋白质序列之间的相似性。
Biomed Res Int. 2019 Nov 22;2019:2796971. doi: 10.1155/2019/2796971. eCollection 2019.
4
Deep learning on chaos game representation for proteins.基于混沌游戏表示的蛋白质深度学习。
Bioinformatics. 2020 Jan 1;36(1):272-279. doi: 10.1093/bioinformatics/btz493.
5
DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information.基于 CGR 利用重塑多种信息对蛋白质序列进行特征提取
BMC Bioinformatics. 2019 Jun 20;20(1):351. doi: 10.1186/s12859-019-2943-x.
6
A Statistical Similarity/Dissimilarity Analysis of Protein Sequences Based on a Novel Group Representative Vector.基于新型组代表向量的蛋白质序列统计相似/相异分析。
Biomed Res Int. 2019 May 8;2019:8702968. doi: 10.1155/2019/8702968. eCollection 2019.
7
Alignment-free similarity analysis for protein sequences based on fuzzy integral.基于模糊积分的蛋白质序列无对齐相似性分析。
Sci Rep. 2019 Feb 26;9(1):2775. doi: 10.1038/s41598-019-39477-8.
8
Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method.基于新型 k-mer 自然向量方法的蛋白质序列系统发育分析。
Genomics. 2019 Dec;111(6):1298-1305. doi: 10.1016/j.ygeno.2018.08.010. Epub 2018 Sep 5.
9
Clustal Omega for making accurate alignments of many protein sequences.Clustal Omega用于对多个蛋白质序列进行精确比对。
Protein Sci. 2018 Jan;27(1):135-145. doi: 10.1002/pro.3290. Epub 2017 Oct 30.
10
Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix.基于理化性质和位置特征能量矩阵的蛋白质序列比较。
Sci Rep. 2017 Apr 10;7:46237. doi: 10.1038/srep46237.