HARVESTMAN：一种从全基因组测序数据中进行层次特征学习和选择的框架。

HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data.

机构信息

Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.

Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA.

出版信息

BMC Bioinformatics. 2021 Apr 1;22(1):174. doi: 10.1186/s12859-021-04096-6.

DOI:10.1186/s12859-021-04096-6

PMID:33794760

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8017869/

Abstract

BACKGROUND

Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present HARVESTMAN, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building.

RESULTS

We demonstrate that HARVESTMAN scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that HARVESTMAN selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare HARVESTMAN to existing feature selection methods and demonstrate that our method is more parsimonious-it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier.

CONCLUSION

HARVESTMAN is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , HARVESTMAN automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, HARVESTMAN is faster and selects features more parsimoniously.

摘要

背景

从高通量测序数据中进行监督学习带来了许多挑战。一方面，维度灾难常常导致过拟合以及可扩展性问题。这可能导致模型不准确或需要大量计算时间和资源。此外，变体调用可能不是给定学习任务的最佳编码，这也会导致预测能力差。为了解决这些问题，我们提出了 HARVESTMAN 方法，该方法利用基因组变体的可能生物学解释和表示之间的层次关系来执行自动特征学习、特征选择和模型构建。

结果

我们证明 HARVESTMAN 可以通过处理来自 1000 基因组计划（最大的公开全基因组序列集合之一）的第三阶段数据来扩展到包含超过 8400 万个变体的数千个基因组。使用来自癌症基因组图谱的乳腺癌数据，我们表明 HARVESTMAN 选择了适应学习任务的丰富表示组合，并且比单独使用 SNP 的二进制表示性能更好。我们将 HARVESTMAN 与现有的特征选择方法进行比较，并证明我们的方法更简约-它选择更小和更少冗余的特征子集，同时保持分类器的准确性。

结论

HARVESTMAN 是一种从变体调用数据进行监督模型构建的分层特征选择方法。通过在基因组变体上构建知识图并求解整数线性规划，HARVESTMAN 自动且最佳地为基因组变体找到正确的编码。与其他分层特征选择方法相比，HARVESTMAN 更快且选择特征更简约。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/91b0f269a7b2/12859_2021_4096_Fig1_HTML.jpg

相似文献

HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data.HARVESTMAN：一种从全基因组测序数据中进行层次特征学习和选择的框架。

BMC Bioinformatics. 2021 Apr 1;22(1):174. doi: 10.1186/s12859-021-04096-6.

Supervised Relevance-Redundancy assessments for feature selection in omics-based classification scenarios.基于组学的分类场景中特征选择的有监督相关性-冗余评估。

J Biomed Inform. 2023 Aug;144:104457. doi: 10.1016/j.jbi.2023.104457. Epub 2023 Jul 23.

Weakly Supervised Deep Learning for Whole Slide Lung Cancer Image Analysis.基于弱监督学习的全幻灯片肺癌图像分析。

IEEE Trans Cybern. 2020 Sep;50(9):3950-3962. doi: 10.1109/TCYB.2019.2935141. Epub 2019 Sep 2.

A Hierarchical Graph Convolution Network for Representation Learning of Gene Expression Data.基于层次图卷积网络的基因表达数据表示学习

IEEE J Biomed Health Inform. 2021 Aug;25(8):3219-3229. doi: 10.1109/JBHI.2021.3052008. Epub 2021 Aug 5.

Antimicrobial resistance genetic factor identification from whole-genome sequence data using deep feature selection.基于全基因组序列数据的深度特征选择进行抗菌药物耐药性遗传因子鉴定。

BMC Bioinformatics. 2019 Dec 24;20(Suppl 15):535. doi: 10.1186/s12859-019-3054-4.

ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.ML-DSP：利用数字信号处理进行机器学习，实现了在所有分类学水平上的超快、准确和可扩展的基因组分类。

BMC Genomics. 2019 Apr 3;20(1):267. doi: 10.1186/s12864-019-5571-y.

Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data.基于高维基因组数据的疾病风险预测可解释深度迁移学习模型。

PLoS Comput Biol. 2022 Jul 15;18(7):e1010328. doi: 10.1371/journal.pcbi.1010328. eCollection 2022 Jul.

A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis.一篇关于高通量测序数据分析中特征选择和特征提取进展的综述。

Funct Integr Genomics. 2024 Aug 19;24(5):139. doi: 10.1007/s10142-024-01415-x.

Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data.利用基因型阵列数据比较多样本和单样本变异检测结果，并改进来自深度覆盖全基因组测序数据的变异检测集。

Bioinformatics. 2017 Apr 15;33(8):1147-1153. doi: 10.1093/bioinformatics/btw786.

AnatomyNet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy.AnatomyNet：用于快速和全自动对头颈部解剖结构进行整体体积分割的深度学习方法。

Med Phys. 2019 Feb;46(2):576-589. doi: 10.1002/mp.13300. Epub 2018 Dec 17.

引用本文的文献

Plant Genotype to Phenotype Prediction Using Machine Learning.利用机器学习进行植物基因型到表型的预测

Front Genet. 2022 May 18;13:822173. doi: 10.3389/fgene.2022.822173. eCollection 2022.

本文引用的文献

Two-Stage Hybrid Gene Selection Using Mutual Information and Genetic Algorithm for Cancer Data Classification.基于互信息和遗传算法的两阶段混合基因选择在癌症数据分类中的应用。

J Med Syst. 2019 Jun 17;43(8):235. doi: 10.1007/s10916-019-1372-8.

The Untranslated Regions of mRNAs in Cancer.癌症中mRNA的非翻译区

Trends Cancer. 2019 Apr;5(4):245-262. doi: 10.1016/j.trecan.2019.02.011. Epub 2019 Mar 22.

Breast cancer survival trends in different stages and age groups - a population-based study 1989-2013.不同阶段和年龄组乳腺癌生存趋势-基于人群的 1989-2013 年研究。

Acta Oncol. 2019 Jan;58(1):45-51. doi: 10.1080/0284186X.2018.1532601. Epub 2018 Dec 4.

The Cancer Spliceome: Reprograming of Alternative Splicing in Cancer.癌症剪接体：癌症中可变剪接的重编程

Front Mol Biosci. 2018 Sep 7;5:80. doi: 10.3389/fmolb.2018.00080. eCollection 2018.

Risk SNP-Mediated Promoter-Enhancer Switching Drives Prostate Cancer through lncRNA PCAT19.风险 SNP 介导的启动子-增强子转换通过 lncRNA PCAT19 驱动前列腺癌。

Cell. 2018 Jul 26;174(3):564-575.e18. doi: 10.1016/j.cell.2018.06.014. Epub 2018 Jul 19.

The High-Throughput Analyses Era: Are We Ready for the Data Struggle?高通量分析时代：我们准备好应对数据难题了吗？

High Throughput. 2018 Mar 2;7(1):8. doi: 10.3390/ht7010008.

ZFX acts as a transcriptional activator in multiple types of human tumors by binding downstream from transcription start sites at the majority of CpG island promoters.ZFX 通过与大多数 CpG 岛启动子的转录起始位点下游结合，在多种类型的人类肿瘤中充当转录激活因子。

Genome Res. 2018 Mar 1;28(3):310-320. doi: 10.1101/gr.228809.117.

Nucleosomes positioning around transcriptional start site of tumor suppressor (Rbl2/p130) gene in breast cancer.乳腺癌中肿瘤抑制基因（Rbl2/p130）转录起始位点周围的核小体定位

Mol Biol Rep. 2018 Apr;45(2):185-194. doi: 10.1007/s11033-018-4151-6. Epub 2018 Feb 7.

Expansion of the Gene Ontology knowledgebase and resources.基因本体知识库及资源的扩展。

Nucleic Acids Res. 2017 Jan 4;45(D1):D331-D338. doi: 10.1093/nar/gkw1108. Epub 2016 Nov 29.

Structured feature selection using coordinate descent optimization.使用坐标下降优化的结构化特征选择

BMC Bioinformatics. 2016 Apr 8;17:158. doi: 10.1186/s12859-016-0954-4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

HARVESTMAN：一种从全基因组测序数据中进行层次特征学习和选择的框架。

HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献