• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于树的自动化机器学习中嵌入协变量调整,用于生物医学大数据分析。

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses.

机构信息

Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA.

Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, 19104, USA.

出版信息

BMC Bioinformatics. 2020 Oct 1;21(1):430. doi: 10.1186/s12859-020-03755-4.

DOI:10.1186/s12859-020-03755-4
PMID:32998684
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7528347/
Abstract

BACKGROUND

A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis.

RESULTS

We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids 'leakage' during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj .

CONCLUSIONS

In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field.

摘要

背景

生物信息学中的一个典型任务是确定哪些特征与感兴趣的目标结果相关,并构建预测模型。自动化机器学习(AutoML)系统,如基于树的管道优化工具(TPOT),是一种很有吸引力的方法。然而,在生物医学数据中,通常存在研究中受试者的基线特征或批次效应,需要对其进行调整,以便更好地分离目标特征对目标的影响。因此,对 AutoML 应用于生物医学大数据分析来说,进行协变量调整的能力变得尤为重要。

结果

我们开发了一种在 TPOT 中调整影响特征和/或目标的协变量的方法。我们的方法基于以避免交叉验证训练过程中“泄漏”的方式回归协变量。我们描述了这种方法在毒理学基因组学和精神分裂症基因表达数据集上的应用。本文讨论的 TPOT 扩展可在 https://github.com/EpistasisLab/tpot/tree/v0.11.1-resAdj 获得。

结论

在这项工作中,我们解决了 AutoML 中的一个重要需求,这对于生物信息学和医学信息学的应用来说尤为重要,即协变量调整。为此,我们对基于遗传编程的 AutoML 方法 TPOT 进行了重大扩展。我们通过对大型毒理学基因组学和差异基因表达数据的应用,展示了该扩展的实用性。该方法在生物医学领域的许多其他场景中也具有普遍适用性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b62e/7528347/cce11f09d4a5/12859_2020_3755_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b62e/7528347/569c660108bf/12859_2020_3755_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b62e/7528347/f8725bb1c10c/12859_2020_3755_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b62e/7528347/e5ef745f858e/12859_2020_3755_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b62e/7528347/cce11f09d4a5/12859_2020_3755_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b62e/7528347/569c660108bf/12859_2020_3755_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b62e/7528347/f8725bb1c10c/12859_2020_3755_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b62e/7528347/e5ef745f858e/12859_2020_3755_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b62e/7528347/cce11f09d4a5/12859_2020_3755_Fig4_HTML.jpg

相似文献

1
Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses.基于树的自动化机器学习中嵌入协变量调整,用于生物医学大数据分析。
BMC Bioinformatics. 2020 Oct 1;21(1):430. doi: 10.1186/s12859-020-03755-4.
2
Scaling tree-based automated machine learning to biomedical big data with a feature set selector.使用特征集选择器将基于树的自动化机器学习扩展到生物医学大数据。
Bioinformatics. 2020 Jan 1;36(1):250-256. doi: 10.1093/bioinformatics/btz470.
3
Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning.代谢组学模型选择:使用自动化机器学习预测冠心病的诊断。
Bioinformatics. 2020 Mar 1;36(6):1772-1778. doi: 10.1093/bioinformatics/btz796.
4
Comparisons of automated machine learning (AutoML) in predicting whistleblowing of academic dishonesty with demographic and theory of planned behavior.自动机器学习(AutoML)在预测学术不端行为举报方面与人口统计学及计划行为理论的比较。
MethodsX. 2023 Sep 7;11:102364. doi: 10.1016/j.mex.2023.102364. eCollection 2023 Dec.
5
Automated machine learning based on radiomics features predicts H3 K27M mutation in midline gliomas of the brain.基于放射组学特征的自动化机器学习预测脑中线胶质瘤中的 H3 K27M 突变。
Neuro Oncol. 2020 Mar 5;22(3):393-401. doi: 10.1093/neuonc/noz184.
6
An automated machine learning approach to predict brain age from cortical anatomical measures.一种基于皮质解剖学指标预测脑龄的自动化机器学习方法。
Hum Brain Mapp. 2020 Sep;41(13):3555-3566. doi: 10.1002/hbm.25028. Epub 2020 May 16.
7
Considerations for automated machine learning in clinical metabolic profiling: Altered homocysteine plasma concentration associated with metformin exposure.临床代谢谱分析中自动化机器学习的考量:与二甲双胍暴露相关的血浆同型半胱氨酸浓度改变。
Pac Symp Biocomput. 2018;23:460-471.
8
Genetic Analysis of Coronary Artery Disease Using Tree-Based Automated Machine Learning Informed By Biology-Based Feature Selection.基于生物学特征选择的树状自动化机器学习在冠状动脉疾病遗传分析中的应用。
IEEE/ACM Trans Comput Biol Bioinform. 2022 May-Jun;19(3):1379-1386. doi: 10.1109/TCBB.2021.3099068. Epub 2022 Jun 3.
9
Automated machine learning to predict the co-occurrence of isocitrate dehydrogenase mutations and O -methylguanine-DNA methyltransferase promoter methylation in patients with gliomas.自动化机器学习预测脑胶质瘤患者异柠檬酸脱氢酶突变和 O-甲基鸟嘌呤-DNA 甲基转移酶启动子甲基化的共现。
J Magn Reson Imaging. 2021 Jul;54(1):197-205. doi: 10.1002/jmri.27498. Epub 2021 Jan 3.
10
Inference of social cognition in schizophrenia patients with neurocognitive domains and neurocognitive tests using automated machine learning.使用自动化机器学习推断精神分裂症患者的神经认知领域和神经认知测试中的社会认知。
Asian J Psychiatr. 2024 Jan;91:103866. doi: 10.1016/j.ajp.2023.103866. Epub 2023 Dec 12.

引用本文的文献

1
The tree-based pipeline optimization tool: Tackling biomedical research problems with genetic programming and automated machine learning.基于树的管道优化工具:用遗传编程和自动化机器学习解决生物医学研究问题。
Patterns (N Y). 2025 Jul 11;6(7):101314. doi: 10.1016/j.patter.2025.101314.
2
Blood-based DNA methylation and exposure risk scores predict PTSD with high accuracy in military and civilian cohorts.基于血液的 DNA 甲基化和暴露风险评分可高精度预测军人和平民队列中的 PTSD。
BMC Med Genomics. 2024 Sep 27;17(1):235. doi: 10.1186/s12920-024-02002-6.
3
A generalizable normative deep autoencoder for brain morphological anomaly detection: application to the multi-site StratiBip dataset on bipolar disorder in an external validation framework.

本文引用的文献

1
Model selection for metabolomics: predicting diagnosis of coronary artery disease using automated machine learning.代谢组学模型选择:使用自动化机器学习预测冠心病的诊断。
Bioinformatics. 2020 Mar 1;36(6):1772-1778. doi: 10.1093/bioinformatics/btz796.
2
Allele Loss and Reduced Expression of CYCLOPS Genes is a Characteristic Feature of Chromophobe Renal Cell Carcinoma.CYCLOPS基因的等位基因缺失和表达降低是嫌色性肾细胞癌的特征性表现。
Transl Oncol. 2019 Sep;12(9):1131-1137. doi: 10.1016/j.tranon.2019.05.005. Epub 2019 Jun 11.
3
Scaling tree-based automated machine learning to biomedical big data with a feature set selector.
用于脑形态异常检测的可推广规范深度自动编码器:在外部验证框架中应用于双相情感障碍的多站点StratiBip数据集。
bioRxiv. 2024 Sep 7:2024.09.04.611239. doi: 10.1101/2024.09.04.611239.
4
Blood-based DNA methylation and exposure risk scores predict PTSD with high accuracy in military and civilian cohorts.基于血液的DNA甲基化和暴露风险评分在军事和 civilian cohorts中能够高度准确地预测创伤后应激障碍。
Res Sq. 2024 Feb 15:rs.3.rs-3952163. doi: 10.21203/rs.3.rs-3952163/v1.
5
MiTree: A Unified Web Cloud Analytic Platform for User-Friendly and Interpretable Microbiome Data Mining Using Tree-Based Methods.MiTree:一个统一的基于树状方法的网络云分析平台,用于实现用户友好且可解释的微生物组数据挖掘。
Microorganisms. 2023 Nov 20;11(11):2816. doi: 10.3390/microorganisms11112816.
6
A Data-Driven Analysis of Ward Capacity Strain Metrics That Predict Clinical Outcomes Among Survivors of Acute Respiratory Failure.基于数据的急性呼吸衰竭幸存者临床结局预测的病房容量紧张指标分析。
J Med Syst. 2023 Aug 5;47(1):83. doi: 10.1007/s10916-023-01978-5.
7
Automated quantitative trait locus analysis (AutoQTL).自动数量性状基因座分析(AutoQTL)。
BioData Min. 2023 Apr 10;16(1):14. doi: 10.1186/s13040-023-00331-3.
8
Reducing the complexity of high-dimensional environmental data: An analytical framework using LASSO with considerations of confounding for statistical inference.降低高维环境数据的复杂性:使用 LASSO 进行分析的框架,并考虑混杂因素对统计推断的影响。
Int J Hyg Environ Health. 2023 Apr;249:114116. doi: 10.1016/j.ijheh.2023.114116. Epub 2023 Feb 16.
9
What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics.什么造就了良好的预测?特征重要性以及开启遗传学中机器学习的黑箱。
Hum Genet. 2022 Sep;141(9):1515-1528. doi: 10.1007/s00439-021-02402-z. Epub 2021 Dec 4.
10
Leveraging Automated Machine Learning for the Analysis of Global Public Health Data: A Case Study in Malaria.利用自动机器学习分析全球公共卫生数据:以疟疾为例的案例研究
Int J Public Health. 2021 Apr 13;66:614296. doi: 10.3389/ijph.2021.614296. eCollection 2021.
使用特征集选择器将基于树的自动化机器学习扩展到生物医学大数据。
Bioinformatics. 2020 Jan 1;36(1):250-256. doi: 10.1093/bioinformatics/btz470.
4
Comprehensive functional genomic resource and integrative model for the human brain.人类大脑的综合功能基因组资源和整合模型。
Science. 2018 Dec 14;362(6420). doi: 10.1126/science.aat8464.
5
Tumour heterogeneity in triplet-paired metastatic tumour tissues in metastatic renal cell carcinoma: concordance analysis of target gene sequencing data.三对配对转移性肾细胞癌肿瘤组织中的肿瘤异质性:目标基因测序数据的一致性分析。
J Clin Pathol. 2019 Feb;72(2):152-156. doi: 10.1136/jclinpath-2018-205456. Epub 2018 Nov 8.
6
ArrayExpress update - from bulk to single-cell expression data.ArrayExpress 更新——从批量到单细胞表达数据。
Nucleic Acids Res. 2019 Jan 8;47(D1):D711-D715. doi: 10.1093/nar/gky964.
7
The UK Biobank resource with deep phenotyping and genomic data.英国生物银行资源库,具有深度表型和基因组数据。
Nature. 2018 Oct;562(7726):203-209. doi: 10.1038/s41586-018-0579-z. Epub 2018 Oct 10.
8
Calcium Channels, Synaptic Plasticity, and Neuropsychiatric Disease.钙通道、突触可塑性与神经精神疾病。
Neuron. 2018 May 2;98(3):466-481. doi: 10.1016/j.neuron.2018.03.017.
9
Analyzing the genes related to nicotine addiction or schizophrenia via a pathway and network based approach.通过基于通路和网络的方法分析与尼古丁成瘾或精神分裂症相关的基因。
Sci Rep. 2018 Feb 13;8(1):2894. doi: 10.1038/s41598-018-21297-x.
10
Downregulation of guanine nucleotide-binding protein beta 1 (GNB1) is associated with worsened prognosis of clearcell renal cell carcinoma and is related to VEGF signaling pathway.鸟嘌呤核苷酸结合蛋白β1(GNB1)的下调与透明细胞肾细胞癌预后恶化相关,且与血管内皮生长因子(VEGF)信号通路有关。
J BUON. 2017 Nov-Dec;22(6):1441-1446.