Martinez Kaitlyn M, Wilding Kristen, Llewellyn Trent R, Jacobsen Daniel E, Montoya Makaela M, Kubicek-Sutherland Jessica Z, Batni Sweta, Manore Carrie, Mukundan Harshini
A-1 Information Systems and Modeling, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
T-6 Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
Sci Rep. 2025 May 13;15(1):16651. doi: 10.1038/s41598-025-00245-6.
The complexity and variability of biological data has promoted the increased use of machine learning methods to understand processes and predict outcomes. These same features complicate reliable, reproducible, interpretable, and responsible use of such methods, resulting in questionable relevance of the derived. outcomes. Here we systematically explore challenges associated with applying machine learning to predict and understand biological processes using a well- characterized in vitro experimental system. We evaluated factors that vary while applying machine learning classifers: (1) type of biochemical signature (transcripts vs. proteins), (2) data curation methods (pre- and post-processing), and (3) choice of machine learning classifier. Using accuracy, generalizability, interpretability, and reproducibility as metrics, we found that the above factors significantly mod- ulate outcomes even within a simple model system. Our results caution against the unregulated use of machine learning methods in the biological sciences, and strongly advocate the need for data standards and validation tool-kits for such studies.
生物数据的复杂性和变异性促使人们更多地使用机器学习方法来理解生物过程并预测结果。同样这些特征也使得此类方法的可靠、可重复、可解释和负责任的使用变得复杂,导致所推导结果的相关性存疑。在此,我们使用一个特征明确的体外实验系统,系统地探讨了将机器学习应用于预测和理解生物过程时所面临的挑战。我们评估了在应用机器学习分类器时会变化的因素:(1)生化特征类型(转录本与蛋白质),(2)数据处理方法(预处理和后处理),以及(3)机器学习分类器的选择。以准确性、通用性、可解释性和可重复性作为指标,我们发现即使在一个简单的模型系统中,上述因素也会显著调节结果。我们的结果警示人们在生物科学中要避免无节制地使用机器学习方法,并强烈主张为此类研究制定数据标准和验证工具包的必要性。