Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.
Department of Biology, Temple University, Philadelphia, PA, USA.
Mol Biol Evol. 2021 Oct 27;38(11):4674-4682. doi: 10.1093/molbev/msab227.
We introduce a supervised machine learning approach with sparsity constraints for phylogenomics, referred to as evolutionary sparse learning (ESL). ESL builds models with genomic loci-such as genes, proteins, genomic segments, and positions-as parameters. Using the Least Absolute Shrinkage and Selection Operator, ESL selects only the most important genomic loci to explain a given phylogenetic hypothesis or presence/absence of a trait. ESL models do not directly involve conventional parameters such as rates of substitutions between nucleotides, rate variation among positions, and phylogeny branch lengths. Instead, ESL directly employs the concordance of variation across sequences in an alignment with the evolutionary hypothesis of interest. ESL provides a natural way to combine different molecular and nonmolecular data types and incorporate biological and functional annotations of genomic loci in model building. We propose positional, gene, function, and hypothesis sparsity scores, illustrate their use through an example, and suggest several applications of ESL. The ESL framework has the potential to drive the development of a new class of computational methods that will complement traditional approaches in evolutionary genomics, particularly for identifying influential loci and sequences given a phylogeny and building models to test hypotheses. ESL's fast computational times and small memory footprint will also help democratize big data analytics and improve scientific rigor in phylogenomics.
我们引入了一种带有稀疏约束的监督机器学习方法用于系统发育基因组学,称为进化稀疏学习(Evolutionary Sparse Learning,ESL)。ESL 将基因组位置(如基因、蛋白质、基因组片段和位置)作为参数构建模型。使用最小绝对收缩和选择算子(Least Absolute Shrinkage and Selection Operator),ESL 仅选择最重要的基因组位置来解释给定的系统发育假设或特征的存在/不存在。ESL 模型不直接涉及核苷酸之间的替代率、位置间的速率变化和系统发育分支长度等常规参数。相反,ESL 直接利用比对中序列之间的一致性与感兴趣的进化假设。ESL 提供了一种自然的方法来组合不同的分子和非分子数据类型,并在模型构建中纳入基因组位置的生物学和功能注释。我们提出了位置、基因、功能和假设稀疏得分,并通过一个示例说明了它们的用法,并提出了 ESL 的几种应用。ESL 框架有可能推动一类新的计算方法的发展,这些方法将补充进化基因组学中的传统方法,特别是在给定系统发育并构建模型来检验假设时,用于识别有影响力的位置和序列。ESL 的快速计算时间和小内存占用也将有助于普及大数据分析并提高系统发育基因组学的科学严谨性。