针对核苷酸和蛋白质序列集的扩展多项目相似性指数。

Extended many-item similarity indices for sets of nucleotide and protein sequences.

作者信息

Bajusz Dávid, Miranda-Quintana Ramón Alain, Rácz Anita, Héberger Károly

机构信息

Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary.

Department of Chemistry and Quantum Theory Project, University of Florida, Gainesville, FL 32611, USA.

出版信息

Comput Struct Biotechnol J. 2021 Jun 16;19:3628-3639. doi: 10.1016/j.csbj.2021.06.021. eCollection 2021.

DOI:10.1016/j.csbj.2021.06.021

PMID:34257841

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8253954/

Abstract

Quantification of similarities between protein sequences or DNA/RNA strands is a (sub-)task that is ubiquitously present in bioinformatics workflows, and is usually accomplished by pairwise comparisons of sequences, utilizing simple ( percent identity) or more intricate concepts ( substitution scoring matrices). Complex tasks (such as clustering) rely on a large number of pairwise comparisons under the hood, instead of a direct quantification of set similarities. Based on our recently introduced framework that enables multiple comparisons of binary molecular fingerprints (, direct calculation of the similarity of fingerprint sets), here we introduce novel symmetric similarity indices for analogous calculations on sets of character sequences with more than two () possible items ( DNA/RNA sequences with = 4, or protein sequences with = 20). The features of these new indices are studied in detail with analysis of variance (ANOVA), and demonstrated with three case studies of protein/DNA sequences with varying degrees of similarity (or evolutionary proximity). The Python code for the extended many-item similarity indices is publicly available at: https://github.com/ramirandaq/tn_Comparisons.

摘要

蛋白质序列或DNA/RNA链之间相似性的量化是生物信息学工作流程中普遍存在的一项（子）任务，通常通过序列的成对比较来完成，使用简单的（百分同一性）或更复杂的概念（替换计分矩阵）。复杂任务（如聚类）实际上依赖于大量的成对比较，而不是直接量化集合相似性。基于我们最近引入的能够对二元分子指纹进行多重比较（即直接计算指纹集的相似性）的框架，在此我们引入了新颖的对称相似性指标，用于对具有两个以上（对于DNA/RNA序列， = 4；对于蛋白质序列， = 20）可能项的字符序列集进行类似计算。通过方差分析（ANOVA）详细研究了这些新指标的特征，并通过三个具有不同相似程度（或进化接近程度）的蛋白质/DNA序列案例研究进行了展示。扩展的多项目相似性指标的Python代码可在以下网址公开获取：https://github.com/ramirandaq/tn_Comparisons 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0ed/8253954/3f0e8dc8014e/ga1.jpg

相似文献

Extended many-item similarity indices for sets of nucleotide and protein sequences.

Comput Struct Biotechnol J. 2021 Jun 16;19:3628-3639. doi: 10.1016/j.csbj.2021.06.021. eCollection 2021.

J Cheminform. 2021 Apr 23;13(1):33. doi: 10.1186/s13321-021-00504-4.

J Cheminform. 2021 Apr 23;13(1):32. doi: 10.1186/s13321-021-00505-3.

iSIM: instant similarity.

Digit Discov. 2024 May 7;3(6):1160-1171. doi: 10.1039/d4dd00041b. eCollection 2024 Jun 12.

Extended continuous similarity indices: theory and application for QSAR descriptor selection.

J Comput Aided Mol Des. 2022 Mar;36(3):157-173. doi: 10.1007/s10822-022-00444-7. Epub 2022 Mar 15.

Molecular Dynamics Simulations and Diversity Selection by Extended Continuous Similarity Indices.

J Chem Inf Model. 2022 Jul 25;62(14):3415-3425. doi: 10.1021/acs.jcim.2c00433. Epub 2022 Jul 14.

Life beyond the Tanimoto coefficient: similarity measures for interaction fingerprints.

J Cheminform. 2018 Oct 4;10(1):48. doi: 10.1186/s13321-018-0302-y.

BLAST and FASTA similarity searching for multiple sequence alignment.

Methods Mol Biol. 2014;1079:75-101. doi: 10.1007/978-1-62703-646-7_5.

Flexible sequence similarity searching with the FASTA3 program package.

Methods Mol Biol. 2000;132:185-219. doi: 10.1385/1-59259-192-2:185.

LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification.

Bioinformatics. 2018 Oct 1;34(19):3281-3288. doi: 10.1093/bioinformatics/bty349.

引用本文的文献

Artif Intell Chem. 2024 Dec;2(2). doi: 10.1016/j.aichem.2024.100077. Epub 2024 Aug 31.

Alternative weighting schemes for fine-tuned extended similarity indices.

J Chemom. 2024 Sep;38(9). doi: 10.1002/cem.3558. Epub 2024 May 11.

Protein Retrieval via Integrative Molecular Ensembles (PRIME) through Extended Similarity Indices.

J Chem Theory Comput. 2024 Jul 23;20(14):6303-6315. doi: 10.1021/acs.jctc.4c00362. Epub 2024 Jul 8.

Sampling and Mapping Chemical Space with Extended Similarity Indices.

Molecules. 2023 Aug 30;28(17):6333. doi: 10.3390/molecules28176333.

SH2db, an information system for the SH2 domain.

Nucleic Acids Res. 2023 Jul 5;51(W1):W542-W552. doi: 10.1093/nar/gkad420.

Molecular Dynamics Simulations and Diversity Selection by Extended Continuous Similarity Indices.

J Chem Inf Model. 2022 Jul 25;62(14):3415-3425. doi: 10.1021/acs.jcim.2c00433. Epub 2022 Jul 14.

Extended continuous similarity indices: theory and application for QSAR descriptor selection.

J Comput Aided Mol Des. 2022 Mar;36(3):157-173. doi: 10.1007/s10822-022-00444-7. Epub 2022 Mar 15.

本文引用的文献

J Cheminform. 2021 Apr 23;13(1):32. doi: 10.1186/s13321-021-00505-3.

J Cheminform. 2021 Apr 23;13(1):33. doi: 10.1186/s13321-021-00504-4.

Differential Consistency Analysis: Which Similarity Measures can be Applied in Drug Discovery?

Mol Inform. 2021 Jul;40(7):e2060017. doi: 10.1002/minf.202060017. Epub 2021 Apr 23.

Discovery of a novel kinase hinge binder fragment by dynamic undocking.

RSC Med Chem. 2020 Mar 4;11(5):552-558. doi: 10.1039/c9md00519f. eCollection 2020 May 1.

Multicriteria decision making for evergreen problems in food science by sum of ranking differences.

Food Chem. 2021 May 15;344:128617. doi: 10.1016/j.foodchem.2020.128617. Epub 2020 Nov 12.

An electrophilic warhead library for mapping the reactivity and accessibility of tractable cysteines in protein kinases.

Eur J Med Chem. 2020 Dec 1;207:112836. doi: 10.1016/j.ejmech.2020.112836. Epub 2020 Sep 12.

Large-scale evaluation of cytochrome P450 2C9 mediated drug interaction potential with machine learning-based consensus modeling.

J Comput Aided Mol Des. 2020 Aug;34(8):831-839. doi: 10.1007/s10822-020-00308-y. Epub 2020 Mar 27.

Structural Implications of STAT3 and STAT5 SH2 Domain Mutations.

Cancers (Basel). 2019 Nov 8;11(11):1757. doi: 10.3390/cancers11111757.

Comparison of Data Fusion Methods as Consensus Scores for Ensemble Docking.

Molecules. 2019 Jul 24;24(15):2690. doi: 10.3390/molecules24152690.

Life beyond the Tanimoto coefficient: similarity measures for interaction fingerprints.

J Cheminform. 2018 Oct 4;10(1):48. doi: 10.1186/s13321-018-0302-y.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

针对核苷酸和蛋白质序列集的扩展多项目相似性指数。

Extended many-item similarity indices for sets of nucleotide and protein sequences.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献