Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil.
Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ GmbH, Leipzig, Saxony, Germany.
Brief Bioinform. 2022 Jul 18;23(4). doi: 10.1093/bib/bbac218.
Recent technological advances have led to an exponential expansion of biological sequence data and extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge has improved the understanding of mechanisms related to several fatal diseases, e.g. Cancer and coronavirus disease 2019, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine and precision medicine. These advances benefit our society and economy, directly impacting people's lives in various areas, such as health care, drug discovery, forensic analysis and food processing. Nevertheless, ML-based approaches to biological data require representative, quantitative and informative features. Many ML algorithms can handle only numerical data, and therefore sequences need to be translated into a numerical feature vector. This process, known as feature extraction, is a fundamental step for developing high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with design and selection of suitable features. Feature engineering, ML algorithm selection and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge. To deal with this problem, we present a new package: BioAutoML. BioAutoML automatically runs an end-to-end ML pipeline, extracting numerical and informative features from biological sequence databases, using the MathFeature package, and automating the feature selection, ML algorithm(s) recommendation and tuning of the selected algorithm(s) hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules: (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyper-parameter tuning modules). We experimentally evaluate BioAutoML in two different scenarios: (i) prediction of the three main classes of noncoding RNAs (ncRNAs) and (ii) prediction of the eight categories of ncRNAs in bacteria, including housekeeping and regulatory types. To assess BioAutoML predictive performance, it is experimentally compared with two other AutoML tools (RECIPE and TPOT). According to the experimental results, BioAutoML can accelerate new studies, reducing the cost of feature engineering processing and either keeping or improving predictive performance. BioAutoML is freely available at https://github.com/Bonidia/BioAutoML.
最近的技术进步导致生物序列数据呈指数级增长,并通过机器学习 (ML) 算法提取有意义的信息。这些知识提高了对与几种致命疾病相关的机制的理解,例如癌症和 2019 年冠状病毒病,有助于开发创新解决方案,例如基于 CRISPR 的基因编辑、冠状病毒疫苗和精准医学。这些进展使我们的社会和经济受益,直接影响到人们在医疗保健、药物发现、法医分析和食品加工等各个领域的生活。然而,基于 ML 的生物数据方法需要代表性、定量和信息丰富的特征。许多 ML 算法只能处理数值数据,因此序列需要转换为数值特征向量。这个过程称为特征提取,是生物信息学中开发高质量基于 ML 模型的基本步骤,允许进行特征工程阶段,设计和选择合适的特征。特征工程、ML 算法选择和超参数调优通常是手动且耗时的过程,需要广泛的领域知识。为了解决这个问题,我们提出了一个新的软件包:BioAutoML。BioAutoML 自动运行端到端的 ML 管道,使用 MathFeature 软件包从生物序列数据库中提取数值和信息丰富的特征,并使用自动化机器学习 (AutoML) 自动执行特征选择、推荐 ML 算法和调整所选算法的超参数。BioAutoML 有两个组件,分为四个模块:(1)自动化特征工程(特征提取和选择模块)和 (2)元学习(算法推荐和超参数调优模块)。我们在两个不同的场景中对 BioAutoML 进行了实验评估:(i) 预测三种主要类型的非编码 RNA(ncRNA)和 (ii) 预测细菌中 8 种 ncRNA 类别,包括管家型和调控型。为了评估 BioAutoML 的预测性能,它与另外两个 AutoML 工具(RECIPE 和 TPOT)进行了实验比较。根据实验结果,BioAutoML 可以加速新的研究,降低特征工程处理的成本,保持或提高预测性能。BioAutoML 可在 https://github.com/Bonidia/BioAutoML 上免费获得。