• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

HARVESTMAN:一种从全基因组测序数据中进行层次特征学习和选择的框架。

HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data.

机构信息

Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.

Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, USA.

出版信息

BMC Bioinformatics. 2021 Apr 1;22(1):174. doi: 10.1186/s12859-021-04096-6.

DOI:10.1186/s12859-021-04096-6
PMID:33794760
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8017869/
Abstract

BACKGROUND

Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present HARVESTMAN, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building.

RESULTS

We demonstrate that HARVESTMAN scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that HARVESTMAN selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare HARVESTMAN to existing feature selection methods and demonstrate that our method is more parsimonious-it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier.

CONCLUSION

HARVESTMAN is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , HARVESTMAN automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, HARVESTMAN is faster and selects features more parsimoniously.

摘要

背景

从高通量测序数据中进行监督学习带来了许多挑战。一方面,维度灾难常常导致过拟合以及可扩展性问题。这可能导致模型不准确或需要大量计算时间和资源。此外,变体调用可能不是给定学习任务的最佳编码,这也会导致预测能力差。为了解决这些问题,我们提出了 HARVESTMAN 方法,该方法利用基因组变体的可能生物学解释和表示之间的层次关系来执行自动特征学习、特征选择和模型构建。

结果

我们证明 HARVESTMAN 可以通过处理来自 1000 基因组计划(最大的公开全基因组序列集合之一)的第三阶段数据来扩展到包含超过 8400 万个变体的数千个基因组。使用来自癌症基因组图谱的乳腺癌数据,我们表明 HARVESTMAN 选择了适应学习任务的丰富表示组合,并且比单独使用 SNP 的二进制表示性能更好。我们将 HARVESTMAN 与现有的特征选择方法进行比较,并证明我们的方法更简约-它选择更小和更少冗余的特征子集,同时保持分类器的准确性。

结论

HARVESTMAN 是一种从变体调用数据进行监督模型构建的分层特征选择方法。通过在基因组变体上构建知识图并求解整数线性规划,HARVESTMAN 自动且最佳地为基因组变体找到正确的编码。与其他分层特征选择方法相比,HARVESTMAN 更快且选择特征更简约。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/0db05346e7dc/12859_2021_4096_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/91b0f269a7b2/12859_2021_4096_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/b307d870ef34/12859_2021_4096_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/c03fb55b4f86/12859_2021_4096_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/5eee330c9b03/12859_2021_4096_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/0db05346e7dc/12859_2021_4096_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/91b0f269a7b2/12859_2021_4096_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/b307d870ef34/12859_2021_4096_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/c03fb55b4f86/12859_2021_4096_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/5eee330c9b03/12859_2021_4096_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c19/8017869/0db05346e7dc/12859_2021_4096_Fig5_HTML.jpg

相似文献

1
HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data.HARVESTMAN:一种从全基因组测序数据中进行层次特征学习和选择的框架。
BMC Bioinformatics. 2021 Apr 1;22(1):174. doi: 10.1186/s12859-021-04096-6.
2
Supervised Relevance-Redundancy assessments for feature selection in omics-based classification scenarios.基于组学的分类场景中特征选择的有监督相关性-冗余评估。
J Biomed Inform. 2023 Aug;144:104457. doi: 10.1016/j.jbi.2023.104457. Epub 2023 Jul 23.
3
Weakly Supervised Deep Learning for Whole Slide Lung Cancer Image Analysis.基于弱监督学习的全幻灯片肺癌图像分析。
IEEE Trans Cybern. 2020 Sep;50(9):3950-3962. doi: 10.1109/TCYB.2019.2935141. Epub 2019 Sep 2.
4
A Hierarchical Graph Convolution Network for Representation Learning of Gene Expression Data.基于层次图卷积网络的基因表达数据表示学习
IEEE J Biomed Health Inform. 2021 Aug;25(8):3219-3229. doi: 10.1109/JBHI.2021.3052008. Epub 2021 Aug 5.
5
Antimicrobial resistance genetic factor identification from whole-genome sequence data using deep feature selection.基于全基因组序列数据的深度特征选择进行抗菌药物耐药性遗传因子鉴定。
BMC Bioinformatics. 2019 Dec 24;20(Suppl 15):535. doi: 10.1186/s12859-019-3054-4.
6
ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.ML-DSP:利用数字信号处理进行机器学习,实现了在所有分类学水平上的超快、准确和可扩展的基因组分类。
BMC Genomics. 2019 Apr 3;20(1):267. doi: 10.1186/s12864-019-5571-y.
7
Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data.基于高维基因组数据的疾病风险预测可解释深度迁移学习模型。
PLoS Comput Biol. 2022 Jul 15;18(7):e1010328. doi: 10.1371/journal.pcbi.1010328. eCollection 2022 Jul.
8
A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis.一篇关于高通量测序数据分析中特征选择和特征提取进展的综述。
Funct Integr Genomics. 2024 Aug 19;24(5):139. doi: 10.1007/s10142-024-01415-x.
9
Using genotype array data to compare multi- and single-sample variant calls and improve variant call sets from deep coverage whole-genome sequencing data.利用基因型阵列数据比较多样本和单样本变异检测结果,并改进来自深度覆盖全基因组测序数据的变异检测集。
Bioinformatics. 2017 Apr 15;33(8):1147-1153. doi: 10.1093/bioinformatics/btw786.
10
AnatomyNet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy.AnatomyNet:用于快速和全自动对头颈部解剖结构进行整体体积分割的深度学习方法。
Med Phys. 2019 Feb;46(2):576-589. doi: 10.1002/mp.13300. Epub 2018 Dec 17.

引用本文的文献

1
Plant Genotype to Phenotype Prediction Using Machine Learning.利用机器学习进行植物基因型到表型的预测
Front Genet. 2022 May 18;13:822173. doi: 10.3389/fgene.2022.822173. eCollection 2022.

本文引用的文献

1
Two-Stage Hybrid Gene Selection Using Mutual Information and Genetic Algorithm for Cancer Data Classification.基于互信息和遗传算法的两阶段混合基因选择在癌症数据分类中的应用。
J Med Syst. 2019 Jun 17;43(8):235. doi: 10.1007/s10916-019-1372-8.
2
The Untranslated Regions of mRNAs in Cancer.癌症中mRNA的非翻译区
Trends Cancer. 2019 Apr;5(4):245-262. doi: 10.1016/j.trecan.2019.02.011. Epub 2019 Mar 22.
3
Breast cancer survival trends in different stages and age groups - a population-based study 1989-2013.不同阶段和年龄组乳腺癌生存趋势-基于人群的 1989-2013 年研究。
Acta Oncol. 2019 Jan;58(1):45-51. doi: 10.1080/0284186X.2018.1532601. Epub 2018 Dec 4.
4
The Cancer Spliceome: Reprograming of Alternative Splicing in Cancer.癌症剪接体:癌症中可变剪接的重编程
Front Mol Biosci. 2018 Sep 7;5:80. doi: 10.3389/fmolb.2018.00080. eCollection 2018.
5
Risk SNP-Mediated Promoter-Enhancer Switching Drives Prostate Cancer through lncRNA PCAT19.风险 SNP 介导的启动子-增强子转换通过 lncRNA PCAT19 驱动前列腺癌。
Cell. 2018 Jul 26;174(3):564-575.e18. doi: 10.1016/j.cell.2018.06.014. Epub 2018 Jul 19.
6
The High-Throughput Analyses Era: Are We Ready for the Data Struggle?高通量分析时代:我们准备好应对数据难题了吗?
High Throughput. 2018 Mar 2;7(1):8. doi: 10.3390/ht7010008.
7
ZFX acts as a transcriptional activator in multiple types of human tumors by binding downstream from transcription start sites at the majority of CpG island promoters.ZFX 通过与大多数 CpG 岛启动子的转录起始位点下游结合,在多种类型的人类肿瘤中充当转录激活因子。
Genome Res. 2018 Mar 1;28(3):310-320. doi: 10.1101/gr.228809.117.
8
Nucleosomes positioning around transcriptional start site of tumor suppressor (Rbl2/p130) gene in breast cancer.乳腺癌中肿瘤抑制基因(Rbl2/p130)转录起始位点周围的核小体定位
Mol Biol Rep. 2018 Apr;45(2):185-194. doi: 10.1007/s11033-018-4151-6. Epub 2018 Feb 7.
9
Expansion of the Gene Ontology knowledgebase and resources.基因本体知识库及资源的扩展。
Nucleic Acids Res. 2017 Jan 4;45(D1):D331-D338. doi: 10.1093/nar/gkw1108. Epub 2016 Nov 29.
10
Structured feature selection using coordinate descent optimization.使用坐标下降优化的结构化特征选择
BMC Bioinformatics. 2016 Apr 8;17:158. doi: 10.1186/s12859-016-0954-4.