Department of Automatic Control and Industrial Informatics, Faculty of Automatic Control and Computer Science, University "Politehnica" of Bucharest, Splaiul Independentei nr. 313, Sector 6, Bucuresti, 060042, Romania.
Comput Methods Programs Biomed. 2021 Nov;211:106418. doi: 10.1016/j.cmpb.2021.106418. Epub 2021 Sep 16.
Backgound and Objective: Detecting differentially expressed genes is an important step in genome wide analysis and expression profiling. There are a wide array of algorithms used in today's research based on statistical approaches. Even though the current algorithms work, they sometimes miss-predict. There is no framework available for measuring the quality of current algorithms. New machine learning methods (like gradient boost and deep neural networks) were not used to solve this problem. The Gene-Bench open source python package addresses these issues by providing an evaluation and data handling system for differentially expressed genes detection algorithms on microarray data. We also provide MIDGET, a new group of algorithms based on state of the art machine learning approaches Methods: The Gene-Bench package provides data collected from real experiments that consists of 73 transcription-factor perturbation experiments with validation data from Chip-seq experiments and 129 drug perturbation experiments, synthetic data generated with our own method and three evaluation metrics (Kolmogorov, F1 and AUC/ROC). Besides the data and metrics, Gene-Bench also contains well-known algorithms and a new method to identify differentially expressed genes, called MIDGET: Machine learning Identification Differential Gene Expression Tool that is using big-data and machine learning methods to identify differentially expressed genes. The two new groups of machine learning algorithms provided in our package use extreme gradient boosting and deep neural networks to achieve their results. Results: The Gene-Bench package is highly flexible, allows fast prototyping and evaluating of new and old algorithms and provides multiple new machine-learning algorithms (called MIDGET) that perform better on all evaluation metrics than all the other tested alternatives. While everything provided in Gene-Bench is algorithm independent, the user can also use algorithms implemented in the R language even though the package is written in Python. Conclusions: The Gene-Bench package fills a gap in evaluating and benchmarking differential gene detection algorithms. It also provides machine learning methods that perform detection with higher accuracy in all tested metrics. It is available at https://github.com/raduangelescu/GeneBench/ and can be directly installed from the Python Package Index using pip install genebench.
在全基因组分析和表达谱中,检测差异表达基因是一个重要步骤。目前的研究中使用了广泛的基于统计方法的算法。尽管目前的算法可以工作,但有时会出现错误预测。目前还没有用于衡量当前算法质量的框架。新的机器学习方法(如梯度提升和深度神经网络)尚未用于解决此问题。Gene-Bench 是一个开源的 Python 包,通过提供用于微阵列数据中差异表达基因检测算法的评估和数据处理系统来解决这些问题。我们还提供了 MIDGET,这是一组基于最新机器学习方法的新算法。
Gene-Bench 包提供了从真实实验中收集的数据,其中包括 73 个转录因子扰动实验,以及来自 Chip-seq 实验的验证数据和 129 个药物扰动实验、使用我们自己的方法生成的合成数据以及三个评估指标(Kolmogorov、F1 和 AUC/ROC)。除了数据和指标外,Gene-Bench 还包含了众所周知的算法和一种新的识别差异表达基因的方法,称为 MIDGET:使用大数据和机器学习方法识别差异表达基因的机器学习识别差异基因表达工具。我们包中提供的两组新的机器学习算法使用极端梯度提升和深度神经网络来获得结果。
Gene-Bench 包具有高度的灵活性,允许快速原型设计和评估新的和旧的算法,并提供多个新的机器学习算法(称为 MIDGET),这些算法在所有评估指标上的表现都优于所有其他测试的替代算法。虽然 Gene-Bench 中提供的所有内容都是算法独立的,但用户也可以使用 R 语言实现的算法,即使该包是用 Python 编写的。
Gene-Bench 包填补了评估和基准测试差异基因检测算法的空白。它还提供了机器学习方法,在所有测试的指标中都能以更高的准确性进行检测。它可在 https://github.com/raduangelescu/GeneBench/ 上获得,也可以使用 pip install genebench 直接从 Python 包索引中安装。