主要DNA/RNA序列描述符的相互关联——一项初步研究

Sen Dwaipayan, Dasgupta Subhadeep, Pal Indrajit, Manna Smarajit, Basak Subhash C, Nandy Ashesh, Grunwald Gregory D

Centre for Interdisciplinary Research and Education, Jodhpur Park, Kolkata 700068, India.

Curr Comput Aided Drug Des. 2016;12(3):216-228. doi: 10.2174/1573409912666160525111918.

UNLABELLED

A large number of alignment-free techniques of graphical representation and numerical characterization (GRANCH) of bio-molecular sequences have been proposed in the recent past years, but the relative efficacy of these methods in determining the degree of similarities and dissimilarities of such sequences have not been ascertained.

OBJECTIVE

Our objective is to make an assessment of the relative efficacy of these methods in determining the degree of similarities and dissimilarities of bio-molecular sequences.

METHOD

We have chosen 7 published/communicated methods that represent various classes of GRANCH techniques and computed the descriptors that are expected to characterize similarities and dissimilarities in several sets of gene sequences. We critically appraise the different methods and determine which of these yield non-redundant structural information that could be used to compute different properties of the sequences, and which are correlated enough to one another so that using the simplest representative of the group would suffice. We also do a principal component analysis (PCA) to determine how the variances in the calculated sequence descriptors are explained by the computed principal components (PCs).

RESULTS

We found that some of the descriptors are strongly correlated implying a commonality of structural information encoded by them while others are distinctly separate. The PCA results show that the first three PC's explain >97% of the variances.

CONCLUSION

We found that some mathematical DNA descriptors calculated by a few of these techniques correlate strongly with one another implying a redundancy in the structural information quantified by those descriptors; others are not strongly correlated with one another suggesting that they encode non-redundant sequence information. From this and our PCA results, our recommendation would be to use minimally correlated set of descriptors or orthogonal descriptors like PCs derived from the descriptor set for the characterization of nucleic acid structure and function.

未标注

近年来已经提出了大量用于生物分子序列图形表示和数值表征（GRANCH）的无比对技术，但这些方法在确定此类序列相似性和差异性程度方面的相对功效尚未得到确定。

目的

我们的目的是评估这些方法在确定生物分子序列相似性和差异性程度方面的相对功效。

方法

我们选择了7种已发表/交流的方法，这些方法代表了GRANCH技术的不同类别，并计算了预期用于表征几组基因序列中相似性和差异性的描述符。我们严格评估了不同的方法，确定哪些方法能产生可用于计算序列不同属性的非冗余结构信息，以及哪些方法彼此之间相关性足够强，以至于使用该组中最简单的代表就足够了。我们还进行了主成分分析（PCA），以确定计算出的主成分（PC）如何解释计算出的序列描述符中的方差。

结果

我们发现一些描述符高度相关，这意味着它们编码的结构信息具有共性，而其他描述符则明显不同。PCA结果表明，前三个PC解释了>97%的方差。

结论

我们发现，通过其中一些技术计算出的一些数学DNA描述符彼此之间高度相关，这意味着这些描述符量化的结构信息存在冗余；其他描述符彼此之间相关性不强，这表明它们编码的是非冗余序列信息。基于此以及我们的PCA结果，我们的建议是使用相关性最小的描述符集或正交描述符，如从描述符集中导出的PC，来表征核酸的结构和功能。

相似文献

Intercorrelation of Major DNA/RNA Sequence Descriptors - A Preliminary Study.

Curr Comput Aided Drug Des. 2016;12(3):216-228. doi: 10.2174/1573409912666160525111918.

Alignment-free sequence comparison using N-dimensional similarity space.

Curr Comput Aided Drug Des. 2010 Dec;6(4):290-6. doi: 10.2174/1573409911006040290.

Extension of molecular similarity analysis approach to classification of DNA sequences using DNA descriptors.

SAR QSAR Environ Res. 2011 Mar;22(1-2):21-34. doi: 10.1080/1062936X.2010.528255.

Numerical characterization of DNA sequences based on digital signal method.

Comput Biol Med. 2009 Apr;39(4):388-91. doi: 10.1016/j.compbiomed.2009.01.009. Epub 2009 Mar 3.

TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications.

J Theor Biol. 2009 Dec 7;261(3):459-68. doi: 10.1016/j.jtbi.2009.08.005. Epub 2009 Aug 11.

Exploring Intrinsic Dimensionality of Chemical Spaces for Robust QSAR Model Development: A Comparison of Several Statistical Approaches.

Curr Comput Aided Drug Des. 2016;12(4):294-301. doi: 10.2174/1573409912666160906111821.

Genomics. 2014 Dec;104(6 Pt B):464-71. doi: 10.1016/j.ygeno.2014.08.010. Epub 2014 Aug 28.

Numerical characterization of DNA sequence based on dinucleotides.

ScientificWorldJournal. 2012;2012:104269. doi: 10.1100/2012/104269. Epub 2012 Apr 24.

J Comput Chem. 2008 Jul 30;29(10):1632-9. doi: 10.1002/jcc.20922.

Principal component analysis of neuronal ensemble activity reveals multidimensional somatosensory representations.

J Neurosci Methods. 1999 Dec 15;94(1):121-40. doi: 10.1016/s0165-0270(99)00130-2.

引用本文的文献

Detection of Alzheimer's Disease using Explainable Machine Learning and Mathematical Models.

J Med Phys. 2025 Jan-Mar;50(1):131-139. doi: 10.4103/jmp.jmp_128_24. Epub 2025 Mar 24.

Graphical representation methods: How well do they discriminate between homologous gene sequences?

Chem Phys. 2018 Sep 24;513:156-164. doi: 10.1016/j.chemphys.2018.07.031. Epub 2018 Jul 26.

Computer-Assisted and Data Driven Approaches for Surveillance, Drug Discovery, and Vaccine Design for the Zika Virus.

Pharmaceuticals (Basel). 2019 Oct 16;12(4):157. doi: 10.3390/ph12040157.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Intercorrelation of Major DNA/RNA Sequence Descriptors - A Preliminary Study.

Curr Comput Aided Drug Des. 2016;12(3):216-228. doi: 10.2174/1573409912666160525111918.

Alignment-free sequence comparison using N-dimensional similarity space.

Curr Comput Aided Drug Des. 2010 Dec;6(4):290-6. doi: 10.2174/1573409911006040290.

Extension of molecular similarity analysis approach to classification of DNA sequences using DNA descriptors.

SAR QSAR Environ Res. 2011 Mar;22(1-2):21-34. doi: 10.1080/1062936X.2010.528255.

Numerical characterization of DNA sequences based on digital signal method.

Comput Biol Med. 2009 Apr;39(4):388-91. doi: 10.1016/j.compbiomed.2009.01.009. Epub 2009 Mar 3.

TN curve: a novel 3D graphical representation of DNA sequence based on trinucleotides and its applications.

J Theor Biol. 2009 Dec 7;261(3):459-68. doi: 10.1016/j.jtbi.2009.08.005. Epub 2009 Aug 11.

Exploring Intrinsic Dimensionality of Chemical Spaces for Robust QSAR Model Development: A Comparison of Several Statistical Approaches.

Curr Comput Aided Drug Des. 2016;12(4):294-301. doi: 10.2174/1573409912666160906111821.

Genomics. 2014 Dec;104(6 Pt B):464-71. doi: 10.1016/j.ygeno.2014.08.010. Epub 2014 Aug 28.

Numerical characterization of DNA sequence based on dinucleotides.

ScientificWorldJournal. 2012;2012:104269. doi: 10.1100/2012/104269. Epub 2012 Apr 24.

J Comput Chem. 2008 Jul 30;29(10):1632-9. doi: 10.1002/jcc.20922.

Principal component analysis of neuronal ensemble activity reveals multidimensional somatosensory representations.

J Neurosci Methods. 1999 Dec 15;94(1):121-40. doi: 10.1016/s0165-0270(99)00130-2.

引用本文的文献

Detection of Alzheimer's Disease using Explainable Machine Learning and Mathematical Models.

J Med Phys. 2025 Jan-Mar;50(1):131-139. doi: 10.4103/jmp.jmp_128_24. Epub 2025 Mar 24.

Graphical representation methods: How well do they discriminate between homologous gene sequences?

Chem Phys. 2018 Sep 24;513:156-164. doi: 10.1016/j.chemphys.2018.07.031. Epub 2018 Jul 26.

Computer-Assisted and Data Driven Approaches for Surveillance, Drug Discovery, and Vaccine Design for the Zika Virus.

Pharmaceuticals (Basel). 2019 Oct 16;12(4):157. doi: 10.3390/ph12040157.

Intercorrelation of Major DNA/RNA Sequence Descriptors - A Preliminary Study.

作者信息

机构信息

出版信息

UNLABELLED

OBJECTIVE

METHOD

RESULTS

CONCLUSION

未标注

目的

方法

结果

结论

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献