Pal Jayanta, Ghosh Soumen, Maji Bansibadan, Bhattacharya Dilip Kumar
Department of ECE, National Institute of Technology, Durgapur 713209, India.
Department of CSE, Narula Institute of Technology, Kolkata 700109, India.
ACS Omega. 2022 Oct 17;7(43):39446-39455. doi: 10.1021/acsomega.2c06103. eCollection 2022 Nov 1.
The difficult aspect of developing new protein sequence comparison techniques is coming up with a method that can quickly and effectively handle huge data sets of various lengths in a timely manner. In this work, we first obtain two numerical representations of protein sequences separately based on one physical property and one chemical property of amino acids. The lengths of all the sequences under comparison are made equal by appending the required number of zeroes. Then, fast Fourier transform is applied to this numerical time series to obtain the corresponding spectrum. Next, the spectrum values are reduced by the standard inter coefficient difference method. Finally, the corresponding normalized values of the reduced spectrum are selected as the descriptors for protein sequence comparison. Using these descriptors, the distance matrices are obtained using Euclidian distance. They are subsequently used to draw the phylogenetic trees using the UPGMA algorithm. Phylogenetic trees are first constructed for 9 ND4, 9 ND5, and 9 ND6 proteins using the polarity value as the chemical property and the molecular weight as the physical property. They are compared, and it is seen that polarity is a better choice than molecular weight in protein sequence comparison. Next, using the polarity property, phylogenetic trees are obtained for 12 baculovirus and 24 transferrin proteins. The results are compared with those obtained earlier on the identical sequences by other methods. Three assessment criteria are considered for comparison of the results-quality based on rationalized perception, quantitative measures based on symmetric distance, and computational speed. In all the cases, the results are found to be more satisfactory.
开发新的蛋白质序列比较技术的难点在于想出一种能够及时快速且有效地处理各种长度的海量数据集的方法。在这项工作中,我们首先基于氨基酸的一种物理性质和一种化学性质分别获得蛋白质序列的两种数值表示。通过添加所需数量的零使所有待比较序列的长度相等。然后,对这个数值时间序列应用快速傅里叶变换以获得相应的频谱。接下来,通过标准的互相关系数差方法降低频谱值。最后,选择降低后的频谱的相应归一化值作为蛋白质序列比较的描述符。使用这些描述符,利用欧几里得距离获得距离矩阵。随后使用UPGMA算法根据这些距离矩阵绘制系统发育树。首先以极性值作为化学性质、分子量作为物理性质构建9种ND4、9种ND5和9种ND6蛋白质的系统发育树。对它们进行比较,可以看出在蛋白质序列比较中极性比分子量是更好的选择。接下来,利用极性性质获得12种杆状病毒和24种转铁蛋白的系统发育树。将结果与通过其他方法在相同序列上早期获得的结果进行比较。考虑三个评估标准来比较结果——基于合理认知的质量、基于对称距离的定量度量以及计算速度。在所有情况下,结果都更令人满意。