School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99164, USA.
School of Biological Sciences, Center for Reproductive Biology, Washington State University, Pullman, WA, 99164-4236, USA.
BMC Bioinformatics. 2023 Nov 7;24(1):419. doi: 10.1186/s12859-023-05557-w.
The performance of machine learning classification methods relies heavily on the choice of features. In many domains, feature generation can be labor-intensive and require domain knowledge, and feature selection methods do not scale well in high-dimensional datasets. Deep learning has shown success in feature generation but requires large datasets to achieve high classification accuracy. Biology domains typically exhibit these challenges with numerous handcrafted features (high-dimensional) and small amounts of training data (low volume).
A hybrid learning approach is proposed that first trains a deep network on the training data, extracts features from the deep network, and then uses these features to re-express the data for input to a non-deep learning method, which is trained to perform the final classification.
The approach is systematically evaluated to determine the best layer of the deep learning network from which to extract features and the threshold on training data volume that prefers this approach. Results from several domains show that this hybrid approach outperforms standalone deep and non-deep learning methods, especially on low-volume, high-dimensional datasets. The diverse collection of datasets further supports the robustness of the approach across different domains.
The hybrid approach combines the strengths of deep and non-deep learning paradigms to achieve high performance on high-dimensional, low volume learning tasks that are typical in biology domains.
机器学习分类方法的性能很大程度上依赖于特征的选择。在许多领域中,特征生成可能需要大量的人工劳动和领域知识,并且特征选择方法在高维数据集中无法很好地扩展。深度学习在特征生成方面取得了成功,但需要大量数据集才能实现高精度的分类。生物学领域通常具有这些挑战,即存在大量手工制作的特征(高维)和少量的训练数据(低量)。
提出了一种混合学习方法,该方法首先在训练数据上训练深度网络,从深度网络中提取特征,然后使用这些特征重新表达数据,以供非深度学习方法输入,该方法经过训练可进行最终分类。
系统地评估了该方法,以确定从深度学习网络中提取特征的最佳层以及偏好该方法的训练数据量阈值。来自多个领域的结果表明,这种混合方法优于独立的深度学习和非深度学习方法,特别是在低量、高维数据集上。多样化的数据集进一步支持了该方法在不同领域的稳健性。
该混合方法结合了深度学习和非深度学习范式的优势,可在生物学领域中常见的高维、低量学习任务中实现高性能。