Bajusz Dávid, Miranda-Quintana Ramón Alain, Rácz Anita, Héberger Károly
Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary.
Department of Chemistry and Quantum Theory Project, University of Florida, Gainesville, FL 32611, USA.
Comput Struct Biotechnol J. 2021 Jun 16;19:3628-3639. doi: 10.1016/j.csbj.2021.06.021. eCollection 2021.
Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple ( percent identity) or more intricate concepts ( substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (, direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two () possible items ( DNA/RNA sequences with = 4, or protein sequences with = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons.
蛋白质序列或DNA/RNA链之间相似性的量化是生物信息学工作流程中普遍存在的一项(子)任务,通常通过序列的成对比较来完成,使用简单的(百分同一性)或更复杂的概念(替换计分矩阵)。复杂任务(如聚类)实际上依赖于大量的成对比较,而不是直接量化集合相似性。基于我们最近引入的能够对二元分子指纹进行多重比较(即直接计算指纹集的相似性)的框架,在此我们引入了新颖的对称相似性指标,用于对具有两个以上(对于DNA/RNA序列, = 4;对于蛋白质序列, = 20)可能项的字符序列集进行类似计算。通过方差分析(ANOVA)详细研究了这些新指标的特征,并通过三个具有不同相似程度(或进化接近程度)的蛋白质/DNA序列案例研究进行了展示。扩展的多项目相似性指标的Python代码可在以下网址公开获取:https://github.com/ramirandaq/tn_Comparisons 。