Wu Xiaoyong, Rai Shesh N, Weber Georg F
Biostatistics and Informatics Shared Resources, University of Cincinnati Cancer Center, College of Medicine, Cincinnati, OH, USA.
Cancer Data Science Center, University of Cincinnati College of Medicine Department of Biostatistics, Health Informatice and Data Sciences, Cincinnati, OH, USA.
Comput Struct Biotechnol J. 2024 Jun 21;23:2637-2647. doi: 10.1016/j.csbj.2024.06.017. eCollection 2024 Dec.
Molecular phylogenetic research has relied on the analysis of the coding sequences by genes or of the amino acid sequences by the encoded proteins. Enumerating the numbers of mismatches, being indicators of mutation, has been central to pertinent algorithms. Specific amino acids possess quantifiable characteristics that enable the conversion from "words" (strings of letters denoting amino acids or bases) to "waves" (strings of quantitative values representing the physico-chemical properties) or to matrices (coordinates representing the positions in a comprehensive property space). The application of such numerical representations to evolutionary analysis takes into account not only the occurrence of mutations but also their properties as influences that drive speciation, because selective pressures favor certain mutations over others, and this predilection is represented in the characteristics of the incorporated amino acids (it is not born out solely by the mismatches). Besides being more discriminating sources for tree-generating algorithms than match/mismatch, the number strings can be examined for overall similarity with average mutual information, autocorrelation, and fractal dimension. Bivariate wavelet analysis aids in distinguishing hypermutable versus conserved domains of the protein. The matrix depiction is readily subjected to comparisons of distances, and it allows the generation of heat maps or graphs. This analysis preserves the accepted taxa order where tree construction with standard approaches yields conflicting results (for the protein S100A6). It also aids hypothesis generation about the origin of mitochondrial proteins. These analytical algorithms have been automated in R and are applicable to various processes that are describable in matrix format.
分子系统发育研究依赖于对基因编码序列或其编码蛋白质的氨基酸序列进行分析。计算错配数(作为突变的指标)一直是相关算法的核心。特定氨基酸具有可量化的特征,这使得能够从“单词”(表示氨基酸或碱基的字母串)转换为“波”(表示物理化学性质的数值串)或矩阵(表示综合性质空间中位置的坐标)。将这种数值表示应用于进化分析,不仅考虑了突变的发生,还考虑了它们作为驱动物种形成的影响因素的性质,因为选择压力有利于某些突变而非其他突变,这种偏好体现在所含氨基酸的特征中(不仅仅由错配体现)。除了比匹配/错配更具区分性地为树生成算法提供数据来源外,还可以通过平均互信息、自相关和分形维数来检查数字串的整体相似性。二元小波分析有助于区分蛋白质的高变区和保守区。矩阵描述易于进行距离比较,并允许生成热图或图表。这种分析在使用标准方法构建树产生冲突结果时(对于蛋白质S100A6)保留了公认的分类单元顺序。它还有助于生成关于线粒体蛋白质起源的假设。这些分析算法已在R中自动化,适用于各种可描述为矩阵格式的过程。