一种基于氨基酸理化性质的精确无对齐蛋白质序列比较器。

An accurate alignment-free protein sequence comparator based on physicochemical properties of amino acids.

机构信息

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.

出版信息

Sci Rep. 2022 Jul 1;12(1):11158. doi: 10.1038/s41598-022-15266-8.

DOI:10.1038/s41598-022-15266-8

PMID:35778592

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9247937/

Abstract

Bio-sequence comparators are one of the most basic and significant methods for assessing biological data, and so, due to the importance of proteins, protein sequence comparators are particularly crucial. On the other hand, the complexity of the problem, the growing number of extracted protein sequences, and the growth of studies and data analysis applications addressing protein sequences have necessitated the development of a rapid and accurate approach to account for the complexities in this field. As a result, we propose a protein sequence comparison approach, called PCV, which improves comparison accuracy by producing vectors that encode sequence data as well as physicochemical properties of the amino acids. At the same time, by partitioning the long protein sequences into fix-length blocks and providing encoding vector for each block, this method allows for parallel and fast implementation. To evaluate the performance of PCV, like other alignment-free methods, we used 12 benchmark datasets including classes with homologous sequences which may require a simple preprocessing search tool to select the homologous data. And then, we compared the protein sequence comparison outcomes to those of alternative alignment-based and alignment-free methods, using various evaluation criteria. These results indicate that our method provides significant improvement in sequence classification accuracy, compared to the alternative alignment-free methods and has an average correlation of about 94% with the ClustalW method as our reference method, while considerably reduces the processing time.

摘要

生物序列比对器是评估生物数据最基本和最重要的方法之一，因此，由于蛋白质的重要性，蛋白质序列比对器尤为关键。另一方面，由于问题的复杂性、提取的蛋白质序列数量的增加，以及针对蛋白质序列的研究和数据分析应用的增长，需要开发一种快速而准确的方法来解决该领域的复杂性。因此，我们提出了一种蛋白质序列比较方法，称为 PCV，该方法通过生成编码序列数据以及氨基酸理化性质的向量来提高比较准确性。同时，通过将长蛋白质序列分割成固定长度的块，并为每个块提供编码向量，该方法允许并行和快速实现。为了评估 PCV 的性能，与其他无比对方法一样，我们使用了 12 个基准数据集，包括具有同源序列的类，这些类可能需要一个简单的预处理搜索工具来选择同源数据。然后，我们使用各种评估标准将蛋白质序列比较结果与替代的基于比对和无比对方法进行比较。这些结果表明，与替代的无比对方法相比，我们的方法在序列分类准确性方面提供了显著的改进，与我们的参考方法 ClustalW 平均相关性约为 94%，同时大大减少了处理时间。