Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.
Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA.
BMC Bioinformatics. 2021 Apr 1;22(1):174. doi: 10.1186/s12859-021-04096-6.
Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present HARVESTMAN, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building.
We demonstrate that HARVESTMAN scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that HARVESTMAN selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare HARVESTMAN to existing feature selection methods and demonstrate that our method is more parsimonious-it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier.
HARVESTMAN is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , HARVESTMAN automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, HARVESTMAN is faster and selects features more parsimoniously.
从高通量测序数据中进行监督学习带来了许多挑战。一方面,维度灾难常常导致过拟合以及可扩展性问题。这可能导致模型不准确或需要大量计算时间和资源。此外,变体调用可能不是给定学习任务的最佳编码,这也会导致预测能力差。为了解决这些问题,我们提出了 HARVESTMAN 方法,该方法利用基因组变体的可能生物学解释和表示之间的层次关系来执行自动特征学习、特征选择和模型构建。
我们证明 HARVESTMAN 可以通过处理来自 1000 基因组计划(最大的公开全基因组序列集合之一)的第三阶段数据来扩展到包含超过 8400 万个变体的数千个基因组。使用来自癌症基因组图谱的乳腺癌数据,我们表明 HARVESTMAN 选择了适应学习任务的丰富表示组合,并且比单独使用 SNP 的二进制表示性能更好。我们将 HARVESTMAN 与现有的特征选择方法进行比较,并证明我们的方法更简约-它选择更小和更少冗余的特征子集,同时保持分类器的准确性。
HARVESTMAN 是一种从变体调用数据进行监督模型构建的分层特征选择方法。通过在基因组变体上构建知识图并求解整数线性规划,HARVESTMAN 自动且最佳地为基因组变体找到正确的编码。与其他分层特征选择方法相比,HARVESTMAN 更快且选择特征更简约。