• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物自动化机器学习:自动化特征工程和元学习,用于预测细菌中的非编码 RNA。

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria.

机构信息

Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil.

Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ GmbH, Leipzig, Saxony, Germany.

出版信息

Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac218.

DOI:10.1093/bib/bbac218
PMID:35753697
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9294424/
Abstract

Recent technological advances have led to an exponential expansion of biological sequence data and extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge has improved the understanding of mechanisms related to several fatal diseases, e.g. Cancer and coronavirus disease 2019, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine and precision medicine. These advances benefit our society and economy, directly impacting people's lives in various areas, such as health care, drug discovery, forensic analysis and food processing. Nevertheless, ML-based approaches to biological data require representative, quantitative and informative features. Many ML algorithms can handle only numerical data, and therefore sequences need to be translated into a numerical feature vector. This process, known as feature extraction, is a fundamental step for developing high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with design and selection of suitable features. Feature engineering, ML algorithm selection and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge. To deal with this problem, we present a new package: BioAutoML. BioAutoML automatically runs an end-to-end ML pipeline, extracting numerical and informative features from biological sequence databases, using the MathFeature package, and automating the feature selection, ML algorithm(s) recommendation and tuning of the selected algorithm(s) hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules: (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyper-parameter tuning modules). We experimentally evaluate BioAutoML in two different scenarios: (i) prediction of the three main classes of noncoding RNAs (ncRNAs) and (ii) prediction of the eight categories of ncRNAs in bacteria, including housekeeping and regulatory types. To assess BioAutoML predictive performance, it is experimentally compared with two other AutoML tools (RECIPE and TPOT). According to the experimental results, BioAutoML can accelerate new studies, reducing the cost of feature engineering processing and either keeping or improving predictive performance. BioAutoML is freely available at https://github.com/Bonidia/BioAutoML.

摘要

最近的技术进步导致生物序列数据呈指数级增长,并通过机器学习 (ML) 算法提取有意义的信息。这些知识提高了对与几种致命疾病相关的机制的理解,例如癌症和 2019 年冠状病毒病,有助于开发创新解决方案,例如基于 CRISPR 的基因编辑、冠状病毒疫苗和精准医学。这些进展使我们的社会和经济受益,直接影响到人们在医疗保健、药物发现、法医分析和食品加工等各个领域的生活。然而,基于 ML 的生物数据方法需要代表性、定量和信息丰富的特征。许多 ML 算法只能处理数值数据,因此序列需要转换为数值特征向量。这个过程称为特征提取,是生物信息学中开发高质量基于 ML 模型的基本步骤,允许进行特征工程阶段,设计和选择合适的特征。特征工程、ML 算法选择和超参数调优通常是手动且耗时的过程,需要广泛的领域知识。为了解决这个问题,我们提出了一个新的软件包:BioAutoML。BioAutoML 自动运行端到端的 ML 管道,使用 MathFeature 软件包从生物序列数据库中提取数值和信息丰富的特征,并使用自动化机器学习 (AutoML) 自动执行特征选择、推荐 ML 算法和调整所选算法的超参数。BioAutoML 有两个组件,分为四个模块:(1)自动化特征工程(特征提取和选择模块)和 (2)元学习(算法推荐和超参数调优模块)。我们在两个不同的场景中对 BioAutoML 进行了实验评估:(i) 预测三种主要类型的非编码 RNA(ncRNA)和 (ii) 预测细菌中 8 种 ncRNA 类别,包括管家型和调控型。为了评估 BioAutoML 的预测性能,它与另外两个 AutoML 工具(RECIPE 和 TPOT)进行了实验比较。根据实验结果,BioAutoML 可以加速新的研究,降低特征工程处理的成本,保持或提高预测性能。BioAutoML 可在 https://github.com/Bonidia/BioAutoML 上免费获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8165/9294424/83c4439321ce/bbac218f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8165/9294424/fb0b29d75422/bbac218f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8165/9294424/2f3ec304b1b5/bbac218f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8165/9294424/7687383996c8/bbac218f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8165/9294424/83c4439321ce/bbac218f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8165/9294424/fb0b29d75422/bbac218f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8165/9294424/2f3ec304b1b5/bbac218f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8165/9294424/7687383996c8/bbac218f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8165/9294424/83c4439321ce/bbac218f4.jpg

相似文献

1
BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria.生物自动化机器学习:自动化特征工程和元学习,用于预测细菌中的非编码 RNA。
Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac218.
2
MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors.MathFeature:基于数学描述符的 DNA、RNA 和蛋白质序列特征提取包。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab434.
3
AutoDC: an automatic machine learning framework for disease classification.AutoDC:一种用于疾病分类的自动化机器学习框架。
Bioinformatics. 2022 Jun 27;38(13):3415-3421. doi: 10.1093/bioinformatics/btac334.
4
Scaling tree-based automated machine learning to biomedical big data with a feature set selector.使用特征集选择器将基于树的自动化机器学习扩展到生物医学大数据。
Bioinformatics. 2020 Jan 1;36(1):250-256. doi: 10.1093/bioinformatics/btz470.
5
Feature extraction approaches for biological sequences: a comparative study of mathematical features.生物序列的特征提取方法:数学特征的比较研究。
Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab011.
6
Automated machine learning in nanotoxicity assessment: A comparative study of predictive model performance.纳米毒性评估中的自动化机器学习:预测模型性能的比较研究
Comput Struct Biotechnol J. 2024 Feb 9;25:9-19. doi: 10.1016/j.csbj.2024.02.003. eCollection 2024 Dec.
7
Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses.基于树的自动化机器学习中嵌入协变量调整,用于生物医学大数据分析。
BMC Bioinformatics. 2020 Oct 1;21(1):430. doi: 10.1186/s12859-020-03755-4.
8
Automated feature engineering improves prediction of protein-protein interactions.自动化特征工程提高蛋白质-蛋白质相互作用预测的准确性。
Amino Acids. 2019 Aug;51(8):1187-1200. doi: 10.1007/s00726-019-02756-9. Epub 2019 Jul 5.
9
Using Automated Machine Learning to Predict the Mortality of Patients With COVID-19: Prediction Model Development Study.利用自动化机器学习预测 COVID-19 患者的死亡率:预测模型开发研究。
J Med Internet Res. 2021 Feb 26;23(2):e23458. doi: 10.2196/23458.
10
BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences.BioAutoMATED:一个用于解释和设计生物序列的端到端自动化机器学习工具。
Cell Syst. 2023 Jun 21;14(6):525-542.e9. doi: 10.1016/j.cels.2023.05.007.

引用本文的文献

1
The tree-based pipeline optimization tool: Tackling biomedical research problems with genetic programming and automated machine learning.基于树的管道优化工具:用遗传编程和自动化机器学习解决生物医学研究问题。
Patterns (N Y). 2025 Jul 11;6(7):101314. doi: 10.1016/j.patter.2025.101314.
2
Current computational tools for protein lysine acylation site prediction.当前用于预测蛋白质赖氨酸酰化位点的计算工具。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae469.
3
BioPrediction-RPI: Democratizing the prediction of interaction between non-coding RNA and protein with end-to-end machine learning.

本文引用的文献

1
Review of bioinformatics in Azheimer's Disease Research.阿尔茨海默病研究中的生物信息学综述。
Comput Biol Med. 2022 Apr;143:105269. doi: 10.1016/j.compbiomed.2022.105269. Epub 2022 Jan 31.
2
MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors.MathFeature:基于数学描述符的 DNA、RNA 和蛋白质序列特征提取包。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab434.
3
BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models.
生物预测-RPI:通过端到端机器学习实现非编码RNA与蛋白质相互作用预测的普及。
Comput Struct Biotechnol J. 2024 May 22;23:2267-2276. doi: 10.1016/j.csbj.2024.05.031. eCollection 2024 Dec.
4
BioDeepfuse: a hybrid deep learning approach with integrated feature extraction techniques for enhanced non-coding RNA classification.BioDeepfuse:一种混合深度学习方法,结合了集成特征提取技术,用于增强非编码 RNA 分类。
RNA Biol. 2024 Jan;21(1):1-12. doi: 10.1080/15476286.2024.2329451. Epub 2024 Mar 25.
5
Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy.用于生物序列分类的信息论:一种基于Tsallis熵的新型特征提取技术。
Entropy (Basel). 2022 Oct 1;24(10):1398. doi: 10.3390/e24101398.
6
BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences.BioAutoMATED:一个用于解释和设计生物序列的端到端自动化机器学习工具。
Cell Syst. 2023 Jun 21;14(6):525-542.e9. doi: 10.1016/j.cels.2023.05.007.
BioSeq-BLM:一个基于生物语言模型分析 DNA、RNA 和蛋白质序列的平台。
Nucleic Acids Res. 2021 Dec 16;49(22):e129. doi: 10.1093/nar/gkab829.
4
A guide to machine learning for biologists.生物学机器学习指南。
Nat Rev Mol Cell Biol. 2022 Jan;23(1):40-55. doi: 10.1038/s41580-021-00407-0. Epub 2021 Sep 13.
5
tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes.tRNAscan-SE 2.0:改进的 tRNA 基因检测和功能分类。
Nucleic Acids Res. 2021 Sep 20;49(16):9077-9096. doi: 10.1093/nar/gkab688.
6
CROTON: an automated and variant-aware deep learning framework for predicting CRISPR/Cas9 editing outcomes.克罗顿:一个自动化且变体感知的深度学习框架,用于预测 CRISPR/Cas9 编辑结果。
Bioinformatics. 2021 Jul 12;37(Suppl_1):i342-i348. doi: 10.1093/bioinformatics/btab268.
7
A hybrid CNN-LSTM model for pre-miRNA classification.用于 miRNA 前体分类的混合 CNN-LSTM 模型。
Sci Rep. 2021 Jul 8;11(1):14125. doi: 10.1038/s41598-021-93656-0.
8
Prediction of Novel Bacterial Small RNAs From RIL-Seq RNA-RNA Interaction Data.基于RIL-Seq RNA-RNA相互作用数据预测新型细菌小RNA
Front Microbiol. 2021 May 21;12:635070. doi: 10.3389/fmicb.2021.635070. eCollection 2021.
9
Bioinformatics helping to mitigate the impact of COVID-19 - Editorial.生物信息学助力减轻新冠疫情的影响——社论
Brief Bioinform. 2021 Mar 22;22(2):613-615. doi: 10.1093/bib/bbab063.
10
Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring.机器学习在微生物生态学、人类微生物组研究和环境监测中的应用。
Comput Struct Biotechnol J. 2021 Jan 27;19:1092-1107. doi: 10.1016/j.csbj.2021.01.028. eCollection 2021.